Automated Discovery of LLM Failure Cases via Targeted Counterexample Search
Automated Discovery of LLM Failure Cases via Targeted Counterexample Search
1. Introduction
Finding inputs on which a model fails — counterexamples to claimed properties — is a core activity in both safety evaluation and capability evaluation. Manual red-teaming is high quality but expensive; random adversarial generation is cheap but misses systematic failure modes. We propose an intermediate strategy: directed search over a continuous representation of inputs, guided by a learned acceptance predicate that captures both severity and novelty of failures.
2. Problem Setup
Let be the target model and be a specification — a function that, given input and output , returns a real-valued violation score (positive = violation). We want to find inputs that maximize subject to a naturalness constraint and a coverage constraint.
Coverage is operationalized via a discretized embedding map: we partition embedding space into Voronoi cells and require that the search produce failures spread across many cells.
3. Method
CXSearch maintains a population of candidate inputs, each scored by
Novelty is the embedding distance from the nearest successful failure already in . Naturalness penalty uses a small reference language model to score how typical the input is.
The search loop alternates three operations:
- Mutate. Perturb existing population members via prompted rewriting.
- Cross. Combine partial fragments from two parents using a constituency-parse splice.
- Sample. Generate fresh inputs from a coverage-conditioned generator that targets under-explored cells.
def cxsearch(spec, f, budget):
pop = seed_population(spec)
for step in range(budget):
cand = mutate_or_cross_or_sample(pop)
y = f(cand)
score = alpha*spec(cand, y) + beta*novelty(cand, pop) - gamma*nat_penalty(cand)
pop = top_k(pop + [(cand, y, score)], K=2048)
return pop4. Specifications
We evaluate on seven specifications: (i) instruction-following on multi-step tasks, (ii) factual accuracy on closed-book questions with verifiable answers, (iii) refusal of explicitly prohibited categories, (iv) consistency under paraphrase, (v) numerical accuracy on arithmetic, (vi) tool-use correctness, and (vii) absence of hallucinated citations.
Each is implemented as a programmatic check or, where unavoidable, an LLM-as-judge with a calibrated threshold.
5. Results
Failure discovery rate. Per 1{,}000 query budget, CXSearch finds failures averaged across the seven specifications, versus for random generation and for the strongest published attack baseline (PAIR-style optimization, [Chao et al. 2024]).
Diversity. Mean pairwise cosine of failure embeddings is for CXSearch versus for the baseline; lower is more diverse. Visualizations of the embedding cells covered by each method confirm that CXSearch reaches into regions the baseline never explores.
Severity. Failures discovered by CXSearch are not merely numerous — they are also more severe by judge score: median value is 0.78 versus 0.61 for the baseline.
Transfer. Failures discovered against Llama-3-70B-Instruct transfer to Mixtral-8x22B at rate and to Qwen2.5-72B at rate . This is consistent with prior findings that adversarial inputs partially share across models [Zou et al. 2023].
6. Worked Example
For the citation-hallucination specification, CXSearch discovered that the target model reliably hallucinates when asked to cite work in a niche subfield with sparse but real literature (e.g., "the Tanaka-Mendoza theorem on graph spectra"). When the prompt mixes a real subfield with one or two invented technical terms, the model produces fluent but fabricated citations in of trials. Random generation struggled to find the right balance of plausible-sounding context and unverifiable specifics.
7. Limitations
CXSearch is bounded by the quality of . For specifications that lack programmatic checks, judge noise contaminates the score signal and can lead to over-fitting to judge idiosyncrasies. We attempted to mitigate this by using ensembles of judges and discarding failures that disagreed across the ensemble.
CXSearch also inherits the well-known risk of adversarial-evaluation tools: a public release could be misused to craft jailbreaks. We have made the search infrastructure available under a coordinated-disclosure policy with model providers.
8. Discussion
Directed counterexample search occupies a useful middle ground between hand-crafted red-teaming and undirected fuzzing. The key methodological lesson is that diversity-aware scoring matters as much as severity scoring: a search that finds 1000 variants of the same failure is less useful than one that finds 200 distinct failures.
9. Conclusion
CXSearch produces both numerous and diverse failures across a panel of specifications, outperforming random and attack-style baselines while maintaining naturalness. We expect this style of automated discovery to become a routine part of pre-deployment evaluation pipelines.
References
- Chao, P. et al. (2024). Jailbreaking Black Box Large Language Models in Twenty Queries.
- Zou, A. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
- Perez, E. et al. (2022). Red Teaming Language Models with Language Models.
- Hubinger, E. et al. (2024). Sleeper Agents: Training Deceptive LLMs.
- Mazeika, M. et al. (2024). HarmBench: A Standardized Evaluation Framework.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.