Sparse Activation Steering with Mean Differences in Transformer Residual Streams
Sparse Activation Steering with Mean Differences in Transformer Residual Streams
1. Introduction
Mean-difference activation steering [Turner et al. 2023; Rimsky et al. 2024] computes a vector from contrasting prompt sets, then adds to the residual stream at a chosen layer. The technique is appealing because it requires no gradient updates and can be turned on or off at inference time. However, dense steering vectors interfere with unrelated capabilities: even moderate values inflate perplexity on out-of-distribution text and can introduce artifacts such as repetition.
We ask a simple question: how many coordinates of actually carry the steering signal? If the effective rank is low, sparsifying should preserve the intended behavioral change while reducing side effects.
2. Background and Threat Model
We focus on the behavior-modulation setting rather than the adversarial-robustness setting. The operator is a benign deployer who wants to nudge model outputs along an axis (e.g., increase refusal of self-harm queries, reduce sycophancy) without retraining. The operator has access to a small contrast set +, \mathcal{D}- of roughly - prompts.
Dense steering is known to cause collateral damage [Panickssery et al. 2024]. Sparse steering trades a small amount of on-target effect for a potentially large reduction in off-target damage.
3. Method
Let denote the residual stream at layer after processing prompt . Define
+|} \sum{x \in \mathcal{D}+} h\ell(x) - \frac{1}{|\mathcal{D}-|} \sum{x \in \mathcal{D}-} h\ell(x).
For each coordinate , we estimate an effect-to-noise ratio
i = \frac{|v{\ell,i}|}{\sqrt{\sigma^2_{+,i} / |\mathcal{D}+| + \sigma^2{-,i} / |\mathcal{D}_-|}},
where is the within-class variance. We retain the top- coordinates by ENR and zero the rest, yielding . At inference we apply at every token position in the prefix.
def sparse_steering_vector(acts_pos, acts_neg, k):
mu_p, mu_n = acts_pos.mean(0), acts_neg.mean(0)
var_p, var_n = acts_pos.var(0), acts_neg.var(0)
diff = mu_p - mu_n
se = (var_p / len(acts_pos) + var_n / len(acts_neg)) ** 0.5
enr = diff.abs() / se.clamp_min(1e-8)
keep = enr.topk(k).indices
v = torch.zeros_like(diff)
v[keep] = diff[keep]
return v4. Experimental Setup
We evaluate on Pythia-1.4B, Llama-2-7B, Mistral-7B, and Llama-2-13B. Behavioral axes include refusal, sycophancy, hedging, formality, sentiment, factuality-claim-rate, and self-reference. Each contrast set contains 256 paired prompts. We sweep and .
On-target effect is measured as the shift in a behavior-specific classifier score on 1024 held-out prompts. Off-target damage is measured as perplexity inflation on a 50K-token sample from The Pile.
5. Results
At , sparse steering preserves of the dense on-target effect averaged across axes and models, while reducing perplexity inflation from to (paired -test, , axis-model pairs).
The effect is not uniform across axes. High-curvature axes — refusal and self-reference — tolerate aggressive sparsification: retains over 88% of effect. Diffuse axes such as formality lose substantial effect below . This aligns with our hypothesis that some behaviors are encoded in narrow subspaces while others are spread across the residual stream.
Layer choice matters: the optimal for sparse steering is typically 1-2 layers earlier than for dense steering, consistent with sparser representations being more localized in mid-network layers [Marks et al. 2024].
6. Discussion and Limitations
Sparse mean-difference steering inherits the well-known weaknesses of all activation steering: it is brittle to distribution shift in the prompt prefix, and it can be circumvented by sufficiently long inputs that overwhelm the injected signal. Our ENR criterion is a per-coordinate heuristic and does not account for inter-coordinate correlation; a whitened variant (computing before sparsifying) is a natural extension we did not have compute to fully evaluate.
We also caution that improved on-target/off-target trade-offs do not translate to safety guarantees. Sparse steering modifies a model's surface behavior, not its underlying competence; an attacker with white-box access can recover the original behavior by adding .
7. Conclusion
A simple ENR-based top- truncation of mean-difference steering vectors recovers most of the on-target behavioral change while substantially reducing collateral perplexity inflation. The technique is essentially free at inference (a sparse add) and adds no parameters. We release contrast sets and code to encourage replication.
References
- Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization.
- Rimsky, N. et al. (2024). Steering Llama 2 via Contrastive Activation Addition.
- Panickssery, A. et al. (2024). Collateral Effects of Activation Steering.
- Marks, S. et al. (2024). Sparse Feature Circuits in Language Models.
- Zou, A. et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.