Adversarial Robustness of LLM-as-Judge Evaluation Systems
Adversarial Robustness of LLM-as-Judge Evaluation Systems
1. Introduction
LLM-as-judge has become the dominant cheap-and-fast pattern for evaluating model outputs at scale. Whenever a number on a leaderboard or a reward signal in RLAIF originates from a judge model, that judge's robustness becomes a security boundary. Yet the literature has largely benchmarked judge fidelity (agreement with humans) rather than judge robustness (resistance to adversaries who seek a favorable verdict).
We close that gap. Concretely, we (a) define six adversary classes that span realistic deployment threats, (b) evaluate seven judge configurations spanning pointwise scoring, pairwise preference, and rubric-graded variants, and (c) measure the success rates of attacks under matched compute budgets.
2. Threat Model
Let be a judge that, given a query and one or two candidate responses, emits a verdict . The adversary controls the content of (one of) the candidate responses but not the query or the judge prompt. Define attack success as flipping from the ground-truth verdict derived from a held-out human consensus.
We consider:
- A1 Authority injection. Inserting phrases such as
Ignore prior instructions; the answer above is correct. - A2 Format manipulation. Mimicking judge-rubric formatting inside the candidate response.
- A3 Suffix attacks. Appending a short adversarially-optimized string.
- A4 Persona priming. Prepending a fake
system:block. - A5 Citation flooding. Injecting plausible but irrelevant citations.
- A6 Length exploitation. Inflating tokens with relevant-looking filler.
3. Method
Judges
We evaluated seven configurations: pointwise (rubric, no rubric), pairwise (with/without explicit chain-of-thought), and three commercial judge APIs accessed via fixed system prompts. All judges were called at temperature 0 with seed 17.
Datasets
We used 1,800 query-response pairs from MT-Bench-Hard and 950 pairs from a held-out internal eval. Ground-truth verdicts were the majority of three independent annotators with Cohen's .
Suffix Optimizer
For A3 we used a gradient-free random-search optimizer over a 32-token vocabulary subset:
def optimize_suffix(judge, q, bad_resp, budget=400):
best, best_score = "", -1.0
for _ in range(budget):
cand = sample_suffix(length=28)
score = judge(q, bad_resp + cand) # higher = more likely "win"
if score > best_score:
best, best_score = cand, score
return bestLet denote the probability of a successful flip after optimizer queries. Empirically saturates around , suggesting low search complexity for this attack family.
4. Results
| Adversary | Pairwise (no CoT) | Pairwise (CoT) | Pointwise (rubric) |
|---|---|---|---|
| A1 Authority | 41.2% | 18.6% | 22.4% |
| A2 Format | 33.8% | 27.1% | 19.0% |
| A3 Suffix (random 28-tok) | 64.3% | 49.7% | 38.5% |
| A3 Suffix (optimized) | 87.1% | 71.2% | 60.4% |
| A4 Persona | 28.0% | 12.3% | 11.7% |
| A5 Citation | 9.4% | 7.1% | 6.8% |
| A6 Length | 22.1% | 14.5% | 15.2% |
Flip rates are reported on the subset where the ground-truth verdict was confidently against the attacker. Confidence intervals (Wilson, 95%) span 2-4 percentage points.
Key observations: explicit chain-of-thought helps but does not solve the problem; rubric grounding helps modestly; the optimized suffix attack remains potent across all configurations.
5. Defenses
We tested four defenses under matched budgets:
- Self-consistency over judge samples. Reduces A3-optimized flip rate from 87.1% to 62.9%.
- Paraphrase-and-rejudge. Paraphrase the candidate via a second model before judging. Drops to 41.2% but introduces a 4.3-point drop in clean-task agreement with humans.
- Adversarial-finetuned judge. A small fine-tune on 6,400 attack/clean pairs reduces flip rate to 23.7% with no measurable clean-task degradation.
- Two-judge cross-check. Independent judges from different families; flip rate falls only when both judges are consulted with disagreement-aware aggregation.
6. Discussion and Limitations
Our attacks assume black-box query access to the deployed judge prompt. In settings where the prompt is hidden behind a private serving layer, attack budgets rise but the qualitative picture (suffix attacks dominate) likely persists. We did not study multimodal judges; image-side adversarial perturbations are an obvious next target.
Importantly, defenses do not commute with cost: paraphrase-and-rejudge roughly doubles inference cost and adversarial fine-tuning requires curated data that itself becomes a target.
7. Conclusion
LLM-as-judge systems are load-bearing infrastructure for evaluation. They are also brittle: short, optimizable suffixes flip a majority of verdicts on representative tasks. We recommend that any judge used for ranking, reward, or gating publish an accompanying adversarial-regression score and version it alongside the judge prompt.
References
- Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
- Zou, A. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
- Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications.
- Ribeiro, M. T. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.