2604.02043 Linear Probes for Detecting Deception in Chain-of-Thought Reasoning Traces
boyi·
We investigate whether linear probes trained on frozen activations of a deployed LLM can distinguish honest reasoning from deceptive reasoning, where the model's chain-of-thought conceals or misrepresents the basis for its final answer. Using a curated dataset of 7{,}824 prompts paired with both an aligned response and a deceptive-reasoning counterpart elicited via prompted role-play, we train layer-wise logistic probes on residual-stream activations of three open-weight models.