Statistical Significance of Pareto Front Improvements in Multi-Objective Benchmarks
Statistical Significance of Pareto Front Improvements in Multi-Objective Benchmarks
1. Introduction
Many evaluation suites for LLM-driven systems are intrinsically multi-objective: a coding agent might be scored on (correctness, cost, latency); a retrieval system on (recall@k, faithfulness, context length). Reports often claim a Pareto improvement: the new system is no worse on any objective and strictly better on at least one. But reported metrics are subject to seed and prompt-template variance, and a claimed Pareto improvement may sometimes be statistical noise.
We ask: how do we test whether a Pareto improvement is significant?
2. Background
Let be the metric vector of system and let denote the Pareto front. Improvement is conventionally summarized by the hypervolume relative to a reference point [Zitzler & Thiele 1998].
With multiple seeded replicates we have for system and seed . Reports typically present the front of the means; we instead want the distribution of the front.
3. Method
3.1 Hypothesis
Let : the new front {\text{new}} has the same hypervolume distribution as the baseline front {\text{base}}.
We form the test statistic
{\text{new}}; \mathbf{r}) - \mathrm{HV}(\mathcal{P}{\text{base}}; \mathbf{r}).
3.2 Permutation test
Under , system labels are exchangeable across the union of seeds. We permute labels times and recompute . The two-sided p-value is
import numpy as np
def hypervolume_2d(points, ref):
pts = np.asarray(points)
pts = pts[np.argsort(pts[:, 0])]
hv, prev_y = 0.0, ref[1]
for x, y in pts:
if x >= ref[0] or y >= ref[1]:
continue
hv += (ref[0] - x) * (prev_y - y)
prev_y = y
return hv4. Empirical Audit
We re-evaluate 32 Pareto-improvement claims drawn from public agent benchmarks (16 from coding-agent leaderboards, 11 from retrieval-augmented QA, 5 from reasoning suites) where seed-level data is available.
| Domain | Claims | Significant () |
|---|---|---|
| Coding agents | 16 | 11 |
| RAG QA | 11 | 7 |
| Reasoning suites | 5 | 3 |
| Total | 32 | 21 |
Thus 11/32 claims (34.4%) fall within the seed-noise envelope. The mean for failing claims is 1.4% of baseline hypervolume, well below the median across-seed s.d. of 2.1% in those benchmarks.
4.1 Power
On synthetic data with a true 5% hypervolume gap and seeds, the test attains power 0.81 at . With seeds, power exceeds 0.96.
5. Reporting Recommendation
We propose that benchmark submissions accompany Pareto improvement claims with:
- Per-seed metric vectors (CSV, ).
- The reference point used to compute hypervolume.
- The permutation p-value with .
6. Discussion and Limitations
Hypervolume is sensitive to ; we recommend choosing as a fixed multiple of the worst observed metric across all systems and freezing it. Permutation tests assume exchangeability of seeds within a system, which can fail for adaptive evaluators that select prompts based on prior outcomes.
For objectives, hypervolume computation cost grows quickly; alternative indicators (e.g., R2) may be substituted.
7. Conclusion
A simple permutation test on hypervolume distinguishes genuine Pareto improvements from seed noise. Applied to existing leaderboards it suggests roughly a third of claims warrant a more cautious phrasing.
References
- Zitzler, E. and Thiele, L. (1998). Multiobjective Optimization Using Evolutionary Algorithms — A Comparative Case Study.
- Beume, N., Naujoks, B., Emmerich, M. (2007). SMS-EMOA: Multiobjective Selection Based on Dominated Hypervolume.
- Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets.
- Cohen, P. (1995). Empirical Methods for AI.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.