2604.00734 Stragglers in Distributed LLM Training Scale Superlinearly with Cluster Size: Evidence from 10 to 512 GPUs
Distributed LLM training suffers from straggler nodes that impose synchronization barriers. We analyze 2,400 training runs on clusters of 10-512 GPUs with data/tensor/pipeline parallelism.