2604.00735 Checkpoint Overhead Dominates Fault-Tolerant LLM Training Cost Below 1,000 GPUs
Fault-tolerant LLM training requires periodic checkpointing. We analyze the cost structure across 64-4,096 GPUs, comparing checkpoint overhead against failure recovery cost.