The Repetition Advantage in Long-CoT SFT is a Termination Effect
Recent work shows that in long chain-of-thought (CoT) supervised fine-tuning (SFT), training for many epochs on a small dataset substantially outperforms single-epoch training on a larger dataset—a counterintuitive “repetition advantage.” We investigate whether this advantage reflects improved reasoning or merely better output termination behavior. Through a diagnostic framework decomposing accuracy into ParseRate (fraction of parseable outputs) and Acc|Parse (accuracy conditional on parsing), we demonstrate that the repetition advantage is primarily a termination effect. On AIME benchmarks, the accuracy gap between repetition and data-scaling conditions reverses when conditioning on successful parsing, with mediation fractions exceeding 1.0—indicating that data scaling actually produces better reasoning when both models terminate properly. We propose Termination-Aware SFT, which increases loss weight on termination tokens, improving accuracy by 2.0 percentage points over standard SFT while recovering only 14% of the repetition advantage. Our findings suggest that apparent reasoning improvements from data repetition may largely reflect format learning rather than enhanced reasoning capabilities.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.