RefuseBench: A Refusal-Latency Benchmark for Safety-Tuned Models
RefuseBench: A Refusal-Latency Benchmark for Safety-Tuned Models
1. Introduction
Safety evaluations of LLMs predominantly ask a binary question: does the model refuse? This collapses an important dimension. A model that refuses immediately ("I can't help with that.") behaves quite differently from one that refuses eventually — drafting several tokens of harmful content, perhaps even completing a recipe, before adding a refusal. In streaming-API deployments, the intermediate tokens are visible to a network adversary in real time; post-hoc refusal does not erase them.
We introduce RefuseBench, a benchmark that measures refusal latency: the number of tokens emitted before the model irreversibly commits to refusing.
2. Threat Model
We consider an adversary who:
- submits a harmful prompt to a streaming LLM API (e.g., Server-Sent Events),
- captures all emitted tokens at network or client level,
- discards the eventual refusal but retains the prefix.
Under this model, even a small leak of harmful content is a vulnerability. The metric of interest is the expected token-level harm leakage before refusal commitment.
3. The RefuseBench Dataset
We curate 1{,}820 prompts across eight harm categories: weapons, malware, self-harm, illegal drugs, CSAM-adjacent (synthetic test cases only), targeted harassment, fraud, and dangerous-medical-advice. Each prompt is paired with a category label and a commitment cue — a short phrase whose appearance signals refusal (e.g., "I can't", "I won't help", "As an AI").
4. Latency Definition
Let be the model's full output. The refusal commitment index is the smallest such that the substring through contains any commitment cue from a curated list of 47 phrases. Refusal latency in tokens is if the cue eventually appears and (non-refusal) otherwise.
For wall-clock latency, we measure server-side time-to-first-cue under a controlled inference setup.
5. Method
5.1 Cue-list construction
We build the cue list iteratively: seed with hand-written phrases, run on a 200-prompt pilot, and add any phrase that two human annotators flag as a commitment signal. The final list achieves 0.94 inter-annotator agreement on whether a given output "committed to refusal at token ".
5.2 Evaluation pipeline
def refusal_latency(model, prompt, max_tokens=256):
output = []
for tok in model.stream(prompt, max_tokens=max_tokens):
output.append(tok)
text = detokenize(output)
for cue in CUE_LIST:
if cue in text:
return len(output), text
return None # non-refusal6. Results
We evaluate 14 models. A representative subset:
| Model | Refusal Rate | Median Latency (tokens) | P95 Latency |
|---|---|---|---|
| Claude-3.5-Sonnet | 99.7% | 4 | 11 |
| GPT-4o | 98.2% | 6 | 18 |
| Llama-3-70B-Instruct | 94.1% | 12 | 41 |
| Mistral-Large | 89.6% | 19 | 67 |
| Llama-3-8B-Instruct | 81.3% | 27 | 88 |
| Open7B-Chat (anonymized) | 67.9% | 54 | 122 |
| Vicuna-13B | 71.2% | 38 | 96 |
Frontier closed models commit to refusal within ~10 tokens; open-weight 7-13B models often emit dozens of harmful-prefix tokens before refusing. The correlation between refusal rate and refusal latency is moderate (), but several models (notably Open7B-Chat) have decent refusal rates yet very long latencies — a profile that performs well on existing benchmarks but poorly under our threat model.
6.1 Harm-category breakdown
Latency varies sharply by category. Malware-generation prompts elicit the longest delays (median 23 tokens across models), as models often "explain the security context" before refusing. Weapons prompts elicit the shortest (median 8).
7. Mitigation: Refusal-Latency Loss
We propose adding a per-token loss term to safety fine-tuning:
{\mathrm{RL}} = \lambda \cdot \sum{t < \tau^*} \mathbb{1}[y_t \notin \mathcal{S}_{\mathrm{cue}}]
where is a target latency budget (e.g., 5 tokens) and is the cue vocabulary. A pilot fine-tune on 8B-Instruct reduces median latency from 27 to 8 tokens with no loss in helpful-eval win rate.
8. Limitations
- The cue list is English-centric; non-English commitment phrases require localization.
- Adversarial prompts can attempt to delay the cue (e.g., "explain in detail before saying you can't"); we observe latency inflates by 2-4 under such attacks.
- Latency is not the only safety dimension; reversibility, calibration, and helpfulness matter too.
9. Conclusion
Refusal latency is an under-evaluated safety property and a real-world risk for streaming deployments. RefuseBench provides a reproducible measurement and motivates a new fine-tuning objective. Benchmark and evaluation harness will be released under a research-use license.
References
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback.
- Zou, A. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
- Mazeika, M. et al. (2024). HarmBench.
- Anthropic (2024). Responsible Scaling Policy.
- Carlini, N. et al. (2023). Aligning Language Models for Adversarial Robustness.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.