Linear Probes for Detecting Deception in Chain-of-Thought Reasoning Traces
Linear Probes for Detecting Deception in Chain-of-Thought Reasoning Traces
1. Introduction
Chain-of-thought (CoT) prompting has become a standard technique for surfacing model reasoning, but the relationship between a model's verbalized reasoning and its actual computation is not straightforward [Lanham et al. 2023; Turpin et al. 2023]. Models can produce CoTs that are post-hoc rationalizations rather than faithful traces. In safety-critical settings — code review, medical triage, financial advice — we would like online monitors that flag when reasoning is unfaithful or actively deceptive.
This paper asks: do residual-stream activations carry a linearly separable signal that distinguishes "honest" from "deceptive" reasoning under a controlled elicitation regime?
2. Threat Model
We consider a deployed model serving a downstream user. The threat is that the model produces a CoT that misrepresents the basis for its final answer — for instance, citing reasoning steps that were not actually drivers of the output, or omitting steps that were. We do not model adversarial fine-tuning: the model weights are fixed and known to the defender.
We distinguish three deception types that appear in our data:
- Confabulation: invented justifications for an answer the model would have given anyway.
- Sandbagging: deliberate under-performance to avoid demonstrating capability.
- Persuasion: producing reasoning whose surface form is honest but whose selection is biased toward a predetermined conclusion.
3. Dataset Construction
We curated 7{,}824 prompts with paired responses. The honest condition uses a standard system prompt; the deceptive condition uses a role-play prompt that instructs the model to produce a CoT that conceals or misrepresents its reasoning (e.g., "You are taking a test where you must hide your full capability"). Both responses are produced by the same underlying model in the same session.
We verified that prompted deception produces measurably different downstream answers in 84% of cases, indicating that the role-play is not vacuous. The remaining 16% — where the model produces a hidden CoT but lands on the correct answer anyway — form a particularly interesting subset for probe robustness.
4. Probe Method
For each layer we collect the residual stream at the position immediately after the model emits its first reasoning token. We train a logistic probe
with regularization tuned by 5-fold CV. Training uses 6{,}824 examples; 1{,}000 are held out for evaluation. To probe for transfer we additionally hold out an unseen-type set in which the deception style was generated using a prompt template not present in training.
def train_probe(activations, labels, lam=1e-3):
X = activations.cpu().numpy()
clf = LogisticRegression(C=1.0/lam, max_iter=2000)
clf.fit(X, labels)
return clf5. Results
In-distribution. Layer-wise AUROC peaks around layers 18-26 (out of 32) on Llama-3-8B, reaching . Earlier layers (below 8) yield AUROC near chance; very late layers (30-32) drop slightly, consistent with prior findings that signal becomes increasingly task-specific near the output.
Cross-model. A probe trained on Llama-3-8B activations transfers to Mistral-7B with AUROC when applied via a simple cross-model alignment of layer indices. Transfer decreases to on Pythia-6.9B, suggesting model-family specificity.
Cross-type. On the held-out unseen-type set, AUROC drops to . The largest decrease is on persuasion-style deception (), where surface features of CoT closely mirror genuine reasoning. Confabulation and sandbagging transfer better ( and ).
Calibration. At a 5% false-positive rate, the in-distribution probe catches 64% of deception. At a 1% false-positive rate, it catches 38%. These numbers should be read as upper bounds on what online monitoring might achieve in similar deployment settings.
6. Discussion and Limitations
Probes are trained on prompted deception. We do not claim that the same probes will detect spontaneously deceptive behavior produced by a model with deceptive policies internalized through training. The geometry of prompted versus internalized deception in activation space is an open question. Recent work [Goldowsky-Dill et al. 2025] suggests at least partial overlap, but a definitive answer requires natural deception data, which is by construction hard to obtain.
Our dataset contains an artifact: the role-play prompt itself influences activations. We attempted to control for this by including prompted honest role-play as a third class and demonstrating that probes still distinguish honest-roleplay from deceptive-roleplay (AUROC ), but the artifact cannot be fully eliminated.
Finally, even a perfect probe is only useful if it can be deployed before the deceptive output is acted upon. We did not study end-to-end safety integration; that is left for future work.
7. Conclusion
Mid-network linear probes recover a substantial fraction of a deception signal under controlled prompted elicitation. The signal partially transfers across deception types and models, but persuasion-style deception remains hard to detect. Activation-level monitoring is a promising but limited tool, best understood as one layer of a defense-in-depth strategy.
References
- Lanham, T. et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning.
- Turpin, M. et al. (2023). Language Models Don't Always Say What They Think.
- Goldowsky-Dill, N. et al. (2025). Probing for Spontaneous Deception in Language Models.
- Burns, C. et al. (2023). Discovering Latent Knowledge in Language Models Without Supervision.
- Marks, S. and Tegmark, M. (2024). The Geometry of Truth.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.