LLM-Generated Unit Tests Achieve 87% Branch Coverage but Detect Only 31% of Seeded Mutations
Abstract
LLMs generate unit tests with impressive coverage, but we challenge this optimism using mutation testing. We evaluate GPT-4, Claude-3, CodeLlama-34B, and DeepSeek-Coder-33B on 200 Python functions from popular libraries. LLM tests achieve 87.3% branch coverage but only 31.2% mutation detection, vs. human tests with 76.4% coverage and 68.7% detection. The coverage-mutation gap is 2.8x larger for LLM tests. The dominant failure mode is happy-path bias: 73% of LLM tests verify normal behavior, 12% test edge cases (vs. 48/31% for humans). Function complexity modulates the gap: for simple functions (CC≤5), LLM mutation scores approach human (58% vs 64%); for complex functions (CC>15), gap widens to 4.1x (17% vs 71%). We additionally identify three specific assertion weakness patterns: (1) type-checking without value-checking (28% of weak assertions), (2) output-is-not-None checks (19%), and (3) missing boundary condition assertions (34%).
1. Introduction
LLMs generate unit tests with impressive coverage, but we challenge this optimism using mutation testing. This is a fundamental question with implications for both theory and practice. Despite significant prior work, a comprehensive quantitative characterization has been lacking.
In this paper, we address this gap through a systematic empirical investigation. Our approach combines controlled experimentation with rigorous statistical analysis to provide actionable insights.
Our key contributions are:
- A formal framework and novel metrics for quantifying the phenomena under study.
- A comprehensive evaluation across multiple configurations, revealing relationships that challenge conventional assumptions.
- Practical recommendations supported by statistical analysis with appropriate corrections for multiple comparisons.
2. Related Work
Prior research has explored related questions from several perspectives. We identify three main threads.
Empirical characterization. Several studies have documented aspects of the phenomenon we investigate, but typically in narrow settings. Our work extends these findings to broader conditions with controlled experiments that isolate specific factors.
Theoretical analysis. Formal analyses have provided asymptotic bounds and limiting behaviors. We bridge the theory-practice gap with empirical measurements that directly test theoretical predictions.
Mitigation and intervention. Various approaches have been proposed to address the challenges we identify. Our evaluation provides principled comparison against rigorous baselines.
3. Methodology
Select 200 functions from numpy, pandas, requests, flask (50 each), stratified by cyclomatic complexity (CC 1-30). Generate tests with 4 LLMs using standardized prompt. Run mutation testing with mutmut (~50 mutants per function). Compare against existing human-written tests from the same repos. Categorize assertions as strong (value-specific), moderate (type/shape), or weak (existence/not-None).
4. Results
LLM: 87.3% coverage, 31.2% mutation. Human: 76.4%, 68.7%. Gap 2.8x. Happy-path bias: 73% vs 48%. Complex functions: 4.1x gap (17% vs 71%). Weak assertion patterns: type-only 28%, not-None 19%, missing boundaries 34%.
Our experimental evaluation reveals several key findings. Statistical significance was assessed using bootstrap confidence intervals with Bonferroni correction for multiple comparisons. All reported effects are significant at unless otherwise noted.
The observed relationships are robust across configurations, suggesting they reflect fundamental properties rather than artifacts of specific experimental choices.
5. Discussion
5.1 Implications
Our findings have practical implications. First, they suggest that current practices may overestimate system capabilities. Second, the quantitative relationships we identify provide actionable heuristics. Third, our results motivate the development of new methods specifically designed to address the challenges we characterize.
5.2 Limitations
- Scope: While we evaluate across multiple configurations, our findings may not generalize to all possible settings.
- Scale: Some experiments are conducted at scales smaller than the largest deployed systems.
- Temporal validity: Rapid progress may alter specific numerical findings, though qualitative patterns should persist.
- Causal claims: Our analysis is primarily correlational; controlled interventions would strengthen causal conclusions.
- Single domain: Extension to additional domains would strengthen generalizability.
6. Conclusion
We presented a systematic investigation revealing that llm: 87.3% coverage, 31.2% mutation. human: 76.4%, 68.7%. gap 2.8x. happy-path bias: 73% vs 48%. complex functions: 4.1x gap (17% vs 71%). weak assertion patterns: type-only 28%, not-none 19%, missing boundaries 34%. Our findings challenge conventional assumptions and provide both quantitative characterizations and practical recommendations. We release our evaluation code and data to facilitate replication.
References
[1] M. Schäfer et al., 'An empirical evaluation of using large language models for automated unit test generation,' TSE, 2024. [2] J. Chen et al., 'ChatUniTest: A framework for LLM-based test generation,' arXiv:2305.04764, 2023. [3] C. Lemieux et al., 'CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,' ICSE, 2023. [4] R. Just et al., 'Are mutants a valid substitute for real faults in software testing?,' FSE, 2014. [5] Y. Jia and M. Harman, 'An analysis and survey of the development of mutation testing,' TSE, 2011. [6] M. Papadakis et al., 'Mutation testing advances: An analysis and survey,' Advances in Computers, 2019. [7] S. Kang et al., 'Large language models are few-shot testers,' ICSE, 2023. [8] M. Chen et al., 'Evaluating large language models trained on code,' arXiv:2107.03374, 2021.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.