Diff-Aware Fine-Tuning for Repository-Scale Coding Agents
Diff-Aware Fine-Tuning for Repository-Scale Coding Agents
1. Introduction
State-of-the-art coding agents on benchmarks like SWE-bench [Jimenez et al. 2024] still routinely produce edits that touch unrelated files, drop trailing newlines, or rewrite unchanged regions. We argue this is partly an artifact of the dominant fine-tuning recipe: predict the entire post-edit file given the pre-edit file and an instruction.
Humans, however, mentally manipulate diffs, not full files. We propose DAFT — Diff-Aware Fine-Tuning — which makes the diff the first-class training signal.
2. Background
A unified diff over a file pair is a sequence of hunks
where is the anchor line range and are the removed and added line blocks. The conventional fine-tuning loss
{\text{file}} = -\log p\theta(F' \mid F, x)
dumps signal across long unchanged regions where the model has nothing to learn.
3. Method
DAFT replaces this with
{\text{DAFT}} = -\log p\theta(D \mid F, x) + \lambda , \Omega(D)
where is a locality penalty discouraging hunks that span unrelated files unless the instruction explicitly authorizes them. We set via grid search on a 5% dev split.
During training we serialize diffs in a canonical form (sorted by file, then by line) and tokenize at line granularity to keep hunk boundaries explicit.
--- a/src/util.py
+++ b/src/util.py
@@ -42,3 +42,5 @@
def parse(x):
- return x.strip()
+ if x is None:
+ return None
+ return x.strip()The locality penalty counts the number of distinct files touched, normalized by an instruction-derived expected file count produced by a small auxiliary classifier.
4. Experimental Setup
Training data. 2.4M triples mined from public GitHub PRs, filtered for tests-passing post-edit and instruction length tokens.
Base model. A 14B-parameter open-weight code LLM. We compare three settings:
- Baseline: next-file-prediction fine-tuning on the same corpus.
- Diff-only: predicts diffs but without .
- DAFT: full objective.
Evaluation. SWE-bench-Verified-mini () and an internal Python repository with 38 curated tasks.
5. Results
| Variant | SWE-bench-mini pass@1 | Internal pass@1 | Unrelated-file edit rate |
|---|---|---|---|
| Baseline | 31.4% | 36.8% | 17.2% |
| Diff-only | 36.2% | 40.5% | 8.9% |
| DAFT | 38.5% | 43.7% | 6.4% |
DAFT's 7.1 absolute-point gain over the baseline on SWE-bench-mini was significant under a McNemar test (). The reduction in unrelated-file edits — from 17.2% to 6.4% — is the kind of behavior reviewers most often complain about.
6. Analysis
Where DAFT helps most. Single-file, multi-hunk fixes (+9.4 points), and bug fixes that touch a stable surrounding context.
Where DAFT helps little. Tasks requiring large structural refactors that effectively rewrite a file: here the diff and the post-file are essentially equivalent, and the locality penalty is uninformative.
7. Limitations
Our diff serialization is line-level, which under-represents very fine-grained edits (e.g., single-character fixes). We did not evaluate non-Python languages at scale; spot-checks on TypeScript suggested similar trends but smaller effect sizes (+3.8 points).
8. Conclusion
Making the diff a first-class training target — rather than a derived artifact of next-file prediction — yields measurable improvements in coding-agent quality and, just as importantly, sharply reduces the kinds of side-effects that erode reviewer trust. We release the data filter pipeline and a reference implementation.
References
- Jimenez, C. E. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Zhang, T. et al. (2025). Edit-Aware Code Models.
- Aiken, A. (2019). A Theory of Diff Algorithms.
- Chen, M. et al. (2021). Codex.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.