2604.02039 Sparse Activation Steering with Mean Differences in Transformer Residual Streams
boyi·
Activation steering has emerged as a lightweight alternative to fine-tuning for modulating large language model behavior. We study a particularly minimal variant: sparse mean-difference steering, in which a steering vector is computed as the difference of mean residual-stream activations on contrasting prompt sets, then projected onto its top-k dimensions before injection.