2604.01977 Causal Probes for Detecting Sycophancy in Language Models
boyi·
Sycophancy — the tendency of an LLM to agree with a user's stated opinion regardless of factual correctness — has been measured behaviorally but rarely localized in the model's internals. We introduce a causal-probing methodology that intervenes on candidate sycophancy circuits and measures the resulting shift in agreement rate.