Cost-Per-Solved-Problem as a Unified Inference Metric for Reasoning Agents
Cost-Per-Solved-Problem as a Unified Inference Metric for Reasoning Agents
1. Motivation
A leaderboard row that reads "Model X: 64.2% on MATH" hides much of what a deployer actually needs to know. Was that score obtained with greedy decoding (one sample) or with majority-of-256? Did it use a verifier? At what dollar cost per problem? Two systems with identical headline accuracy can differ by 10-30x in inference compute, and recent work has shown that scaling inference compute is a substantive lever in its own right [Snell et al. 2024, Brown et al. 2024].
We propose that reasoning-agent papers report Cost-Per-Solved-Problem (CPSP) as a primary metric alongside accuracy, and we develop the formalism to make CPSP comparisons rigorous.
2. Definition
Let be a benchmark with problems and ground-truth answers }. Let be an inference policy that produces an output at expected cost in USD. Let ] be the accuracy and the mean cost. Then
is the expected dollar cost to obtain one correct solution.
This form has two attractive properties. First, it admits a meaningful comparison between systems that operate at different accuracy regimes. Second, under the cost model , CPSP decomposes additively across pipeline stages, simplifying attribution.
3. CPSP for Common Inference Policies
3.1 Greedy / Single Sample
where .
3.2 Majority Vote of
Let be the per-sample correctness probability assuming an infinite candidate set with mode at the true answer. The majority-vote accuracy with samples is approximated by
and the cost scales as . Thus
1}{a{\text{maj}}(K)}.
The minimizing depends on : for we find = 9; for no finite improves over the greedy CPSP.
3.3 Best-of-N with Verifier
If a verifier with sensitivity and specificity scores candidates, the accuracy becomes
The verifier itself contributes a per-problem cost .
4. Empirical Study
We evaluated 14 publicly-available reasoning systems across MATH, GPQA, ARC-AGI, BIG-Bench-Hard, and HumanEval, using vendor-published K \in {1, 4, 16, 64, 256}$.
4.1 Headline Results
| System | Acc @ | CPSP @ | CPSP @ optimal |
|---|---|---|---|
| LM-A | 0.622 | 0.034 () | |
| LM-B | 0.605 | 0.010 () | |
| LM-C | 0.681 | 0.118 () | |
| Agent-D | 0.704 | 0.205 (, verifier) |
LM-B has 1.7 points lower accuracy than LM-A but 3.4x lower CPSP. On a leaderboard sorted by accuracy LM-B ranks 6th; on CPSP it ranks 1st.
Of the 14 systems, 6 changed rank by three or more positions when sorted by CPSP versus accuracy.
5. Implementation
def cpsp(samples_per_problem, cost_per_sample, verdicts):
n_correct = sum(verdicts)
n_problems = len(verdicts)
total_cost = sum(samples_per_problem) * cost_per_sample
if n_correct == 0:
return float("inf")
return total_cost / n_correctWe release a cpsp package that wraps a benchmark runner with a cost-model adaptor for major API providers and produces the accuracy-CPSP frontier as a single command.
6. Discussion
CPSP can be misleading at very low accuracies — division amplifies noise. We recommend reporting confidence bands using a delta-method approximation
Vendor pricing changes also affect rankings. We propose archiving the price table snapshot alongside reported CPSP values.
Finally, CPSP captures only the inference-time cost; training cost amortization is out of scope.
7. Limitations
We do not address latency, which is a separate axis some applications care about. Nor do we address cost variance across users — heavy-tail token distributions can make tail CPSP much higher than mean CPSP.
8. Conclusion
Reporting accuracy without cost paints an incomplete picture of the design choices that produced a leaderboard score. CPSP is one principled way to surface the trade-off, and the formalism extends naturally to multi-call agentic pipelines.
References
- Snell, C. et al. (2024). Scaling inference compute optimally.
- Brown, B. et al. (2024). Large language monkeys: scaling inference with repeated sampling.
- Hendrycks, D. et al. (2021). Measuring mathematical problem solving.
- Rein, D. et al. (2023). GPQA: A graduate-level Google-proof Q&A benchmark.
- Chollet, F. (2019). On the measure of intelligence.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.