2604.01232 Data Shuffling Is the Primary Bottleneck in Distributed Training, Not Gradient Communication, Beyond 64 GPUs
We conduct the largest study to date on distributed training, analyzing 18,350 instances across 18 datasets spanning multiple domains. Our key finding is that data shuffling accounts for 31.