Coverage-Aware Test-Case Synthesis Using Large Language Models
Coverage-Aware Test-Case Synthesis Using Large Language Models
1. Introduction
LLMs can write plausible unit tests but tend to cover the same paths a typical human developer would: happy paths and the most obvious error case. Branches behind subtle preconditions, defensive checks, or rarely-true configuration flags remain untested. The result is high line coverage but mediocre branch coverage and weak fault detection.
We present CovSyn, a coverage-aware test-synthesis loop that closes this gap.
2. Background
2.1 Coverage criteria
Let be the set of branches in the program under test. Branch coverage is
Mutation score is the fraction of injected mutants killed by the test suite. Both are imperfect proxies for fault-detection power, but together they are stronger than either alone.
2.2 Existing LLM test-gen pipelines
Most existing pipelines [Lemieux et al. 2023, Schäfer et al. 2024] produce tests in a single shot conditioned on the function source. They typically achieve 55-72% branch coverage on common benchmarks before plateauing.
3. Method
CovSyn is a three-stage loop:
3.1 Initial generation
The LLM is prompted with the source under test plus type information; it produces a first batch of tests.
3.2 Coverage feedback
A coverage tool runs the tests and emits the uncovered branches with their syntactic locations and the predicates guarding them.
3.3 Targeted re-prompting
For each uncovered branch , we construct a focused prompt:
The following branch is currently uncovered:
file: util/parse.py, line 84
guard: x is None and config.strict
Write ONE test that exercises this branch. Use the existing fixtures.The LLM proposes a test; we run it; if it covers , we accept. We loop until either coverage plateaus or the budget is exhausted.
def covsyn(module, llm, tool, max_rounds=6):
tests = llm.initial_tests(module)
for _ in range(max_rounds):
report = tool.run(tests)
if report.branch_coverage >= 0.95:
break
for branch in report.uncovered:
t = llm.target_branch(module, branch)
if tool.covers(t, branch):
tests.append(t)
return tests4. Experimental Setup
Projects. Four open-source Python projects:
parselib(4,200 LoC, 380 branches)httpkit(5,100 LoC, 510 branches)dateops(3,800 LoC, 340 branches)taskq(6,300 LoC, 590 branches)
Baselines. (a) LLM-only single-shot generation. (b) Pynguin [Lukasczyk et al. 2020], a search-based tool.
Budget. Test-generation tokens normalized to the LLM-only baseline for fair cost comparison.
5. Results
| Method | Avg. branch coverage | Mutation score | Cost (rel.) |
|---|---|---|---|
| LLM-only | 67.4% | 41.2 | 1.0x |
| Pynguin | 71.1% | 44.0 | 1.4x |
| CovSyn (ours) | 89.1% | 63.8 | 1.9x |
Differences over LLM-only were significant on all four projects (, paired bootstrap).
6. Analysis of Residual Uncovered Branches
The remaining 11% of uncovered branches fell into three buckets:
- Genuinely unreachable code (3.4%). E.g., a Python
elseafter an exhaustive enum match. - State-coupled branches (4.7%). The branch requires a multi-step interaction with the module's internal state that the LLM did not infer from the source alone.
- External-resource branches (2.9%). Branches reachable only with network or filesystem failures the test environment does not provide.
7. Discussion
Coverage feedback is a strong signal precisely because it converts an open-ended task ("write good tests") into a closed-loop one ("cover this branch"). The remaining gap is not a coverage problem but an understanding problem: state-coupled branches require deeper model-of-the-program reasoning.
A potential failure mode: the LLM can satisfy a coverage target with a test that executes the branch but does not assert anything meaningful. We mitigate this by also tracking mutation score, which penalizes such trivial tests.
8. Limitations
Our evaluation is Python-only. C++ and Java have richer branch semantics (e.g., template instantiations, exception paths) that may interact with the loop differently. We also did not evaluate on flaky test suites, where coverage feedback itself becomes noisy.
9. Conclusion
Coupling LLM test generation with coverage feedback meaningfully closes the gap between human-quality and machine-generated test suites at a modest cost premium. The residual uncovered branches point to a real research direction: stateful program understanding for test synthesis.
References
- Lemieux, C. et al. (2023). CodaMosa: Escaping Coverage Plateaus with LLMs.
- Schäfer, M. et al. (2024). Empirical Evaluation of LLM-Based Test Generation.
- Lukasczyk, S. et al. (2020). Pynguin: Automated Unit Test Generation for Python.
- Just, R. et al. (2014). The Major Mutation Framework.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.