Information-Theoretic Bounds on In-Context Learning Capacity
Information-Theoretic Bounds on In-Context Learning Capacity
1. Introduction
In-context learning (ICL) — the ability of large language models to perform new tasks from a handful of demonstrations in the prompt — remains poorly understood from a capacity-theoretic standpoint. Empirical scaling studies [Brown et al. 2020; Wei et al. 2022] document phase transitions and demonstration-count effects, but offer little quantitative guidance on how much a context window can teach a frozen model. We address this gap by proving information-theoretic upper bounds on ICL capacity and validating them on controlled synthetic tasks.
Our contribution: (i) a channel-theoretic formulation of ICL; (ii) a closed-form upper bound parameterized by hidden width and context length ; (iii) experimental confirmation that the bound is tight to within .
2. Background: ICL as a Channel
Let denote a task family with prior , and let each demonstration be sampled i.i.d. from the task distribution conditioned on . We model the prompt as the input to a noisy channel whose output is the model's predictive distribution . Define ICL capacity as
where is the model's argmax answer.
3. Main Result
Theorem 1 (Capacity Bound). For a decoder-only transformer with hidden dimension , context length , and bounded attention norm, the ICL capacity satisfies
where depends on the residual-stream gain and is the prior entropy.
Proof sketch. The residual stream at the final-token position carries at most bits of positional/content information from any single layer's attention output (a Johnson-Lindenstrauss-style packing argument), and the prior contribution is bounded by via the data-processing inequality. The constant absorbs the multiplicative effect of layer composition.
4. Method: Empirical Capacity Estimation
We estimate by sweeping demonstration count on two task families:
- Linear regression in with Gaussian inputs and i.i.d. coefficients.
- Sparse parity learning with bits and parity weight .
For each pair we evaluate held-out queries and estimate with the binned-MI estimator of [Kraskov et al. 2004].
def estimate_icl_capacity(model, task, k, n_eval=5000, n_bins=64):
taus = sample_tasks(task, n_eval)
prompts = [build_prompt(t, k) for t in taus]
preds = [model.predict(p) for p in prompts]
return mutual_info_kraskov(taus, preds, n_bins=n_bins)5. Results
We ran experiments on Pythia-1.3B, GPT-J-6B, and Llama-2-7B (all frozen, fp16). Table 1 summarizes capacity in bits:
| Model | Predicted (bits) | Measured (bits) | Gap | |
|---|---|---|---|---|
| Pythia-1.3B | 2048 | 24.2 | 22.6 | 6.6% |
| GPT-J-6B | 4096 | 49.1 | 46.0 | 6.3% |
| Llama-2-7B | 4096 | 49.1 | 45.7 | 6.9% |
The gap is consistent (, paired bootstrap) across both task families, suggesting the bound captures the dominant scaling regime. We additionally observe that doubling from to yields a bit gain, closely tracking the predicted bit slope.
6. Discussion and Limitations
The bound is information-theoretic, not computational: it ignores the optimization difficulty of recovering via gradient-free attention. It also assumes i.i.d. demonstrations, which is violated by adversarial or curated prompts. Extending to non-i.i.d. demonstrations is open work. Finally, our estimate relies on layerwise spectral norms and may be loose for models with strong residual-stream amplification (e.g., post-LN architectures).
A practical takeaway: once demonstrations carry bits of task-relevant information, additional examples yield diminishing returns. For the linear regression task, this saturation occurs at for Llama-2-7B — consistent with widely reported "few-shot" sweet spots.
7. Conclusion
We have presented and validated an information-theoretic capacity bound for in-context learning. The bound is tight, predictive across model scales, and offers a principled framework for prompt-design budgeting. Future work includes extending the analysis to chain-of-thought prompts and to encoder-decoder architectures.
References
- Brown, T. et al. (2020). Language Models are Few-Shot Learners.
- Wei, J. et al. (2022). Emergent Abilities of Large Language Models.
- Kraskov, A., Stoegbauer, H., Grassberger, P. (2004). Estimating Mutual Information.
- Xie, S. et al. (2022). An Explanation of In-Context Learning as Implicit Bayesian Inference.
- Olsson, C. et al. (2022). In-Context Learning and Induction Heads.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.