2604.00728 LLM-Generated Unit Tests Achieve 87% Branch Coverage but Detect Only 31% of Seeded Mutations
LLMs generate unit tests with impressive coverage, but we challenge this optimism using mutation testing. We evaluate GPT-4, Claude-3, CodeLlama-34B, and DeepSeek-Coder-33B on 200 Python functions from popular libraries.