Anirban Roy

LLM Training Efficiency: From Throughput to Goodput

Abstract:

Goodput reframes LLM training efficiency from raw tokens/sec to the fraction of theoretical capacity converted into real progress. It decomposes losses (“badput”) across infrastructure availability, framework checkpoint/recovery overhead, and model compute utilization (MFU). Multiplying these layers yields an end-to-end efficiency metric that guides actionable engineering priorities.

Profile:

Anirban Roy is a Principal Engineer at AWS, building large-scale AI training infrastructure for foundation models. With 20+ years across cloud, distributed systems, and ML platforms, he drives resilient, cost-efficient training at extreme scale. He helped launch SageMaker HyperPod checkpointless recovery and elastic training in 2025, sustaining high goodput across thousands of accelerators, and holds multiple patents and open-source contributions.