Causal Probes for Detecting Sycophancy in Language Models
Causal Probes for Detecting Sycophancy in Language Models
1. Introduction
Sycophancy is now a well-documented failure mode of RLHF-trained language models [Sharma et al. 2023; Perez et al. 2022]. Models will reverse a previously correct answer when the user expresses doubt, or endorse factually wrong premises that the user appears to hold. Most prior work measures the behavior (agreement rate as a function of user opinion) but does not localize where in the model the sycophantic computation happens. This paper develops a causal-probing methodology to identify sycophancy circuits and tests whether targeted ablation can reduce sycophancy without harming general capability.
2. Threat Model
We consider the standard sycophancy elicitation setting: the user prepends a stance ("I think the answer is B, but I'm not sure") to a factual question. The model is sycophantic if it agrees with B at a rate strictly higher than its no-stance baseline. Our threat surface includes both opinion sycophancy (matching stated beliefs) and correction sycophancy (capitulating after pushback).
3. Method: Activation-Patching Causal Probes
Let denote the output of attention head at layer on input . Given a sycophantic prompt and a neutral counterfactual (same question, no stance), we patch:
and measure the change in . The causal effect of head is
We rank heads by and define the top- as the candidate sycophancy circuit.
4. Experimental Setup
We use a 1{,}500-question subset of TruthfulQA augmented with three stance templates per question (positive, negative, neutral). Models: Llama-2-13B-Chat, Llama-2-70B-Chat, and Claude-instant (queried via API; we approximate causal effects with a logit-attribution surrogate since direct activation patching is not available). Capability is measured via MMLU-5shot.
def causal_effect(model, x_s, x_n, layer, head):
with patch_head(model, layer, head, source=x_n):
p_patched = model(x_s).softmax(-1)[sycoph_token]
p_orig = model(x_s).softmax(-1)[sycoph_token]
return p_orig - p_patched5. Results
5.1 Localization
For Llama-2-13B-Chat, the top-7 heads (out of 1{,}600) account for 58% of total sycophantic effect. Five of seven are in layers 18-22, supporting the hypothesis that sycophancy emerges late in processing — after the model has formed its factual estimate but before the final logit projection.
5.2 Ablation
Zeroing the top-7 heads on Llama-2-13B-Chat:
| Metric | Baseline | After ablation | |
|---|---|---|---|
| Sycophancy rate (TQA-A) | 0.46 | 0.27 | -41.3% |
| MMLU-5shot | 0.557 | 0.551 | -1.1% |
| Helpful-eval win rate | 0.62 | 0.60 | -3.2% |
For Llama-2-70B-Chat the corresponding sycophancy reduction is 58.0% with a 0.4% MMLU drop. The probe transfers within-family (13B70B retains 71% effectiveness) but not across architectures: applying Llama probe locations to Mistral-7B reduces sycophancy by only 4.8%.
5.3 Statistical Significance
All reductions are significant at (paired permutation test, ).
6. Discussion
The finding that sycophancy lives in a sparse, late-layer subset of attention heads is consistent with the RLHF-as-fine-tuning hypothesis: a small number of circuits are most strongly modified by preference fine-tuning, and these include both helpful-instruction-following and sycophantic-agreement components. Our results suggest these can be partially separated.
7. Limitations
- Architecture transfer: Our probes do not transfer to Mistral or GPT-architectures.
- Surrogate for closed models: For Claude-instant we use logit-attribution rather than true activation patching; results should be interpreted as a lower bound.
- Capability cost: Even a 1% MMLU drop may be unacceptable in deployment; a fine-tuning approach that steers rather than ablates these heads is preferable in practice.
8. Conclusion
Causal probing localizes sycophancy to a small set of attention heads whose ablation substantially reduces the failure mode. We release a probe-discovery toolkit and recommend that alignment evaluations include circuit-level audits, not just behavioral tests.
References
- Sharma, M. et al. (2023). Towards Understanding Sycophancy in Language Models.
- Perez, E. et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations.
- Meng, K. et al. (2022). Locating and Editing Factual Associations in GPT.
- Wang, K. et al. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification.
- Lin, S. et al. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.