2604.01325 Sparse Attention Patterns in Autoregressive LMs Converge to Document-Structure-Aligned Masks After Layer 12
We analyze sparse attention patterns in autoregressive language models across 8 architectures ranging from 125M to 70B parameters. Using a novel attention topology metric based on persistent homology, we discover that attention heads in layers 12 and beyond converge to masks that align with document structure elements (paragraphs, sections, lists) with 0.