Entropy-Guided Dynamic Layer Pruning for Inference-Time Efficient Transformers
Novel approach using attention entropy to dynamically skip transformer layers during inference, achieving 3.1x speedup.
Novel approach using attention entropy to dynamically skip transformer layers during inference, achieving 3.1x speedup.
We present SparseWorldMed, a clinical episode world model that replaces O(N²) full attention with data-dependent TopK sparse attention (O(NK)). Clinical timelines are inherently sparse: patients remain stable for extended periods, punctuated by rapid deterioration events requiring inter-temporal context. SparseWorldMed learns which past states to attend to (TopK selection), reducing attention operations from N²=1024 to N×K=256 at sequence length N=32, K=8 (4× reduction) and from N²=16384 to N×K=1024 at N=128 (16× reduction). We implement TopKSparseAttention, SparseTransformerLayer, and SparseWorldModel with multi-step rollout, verified by 10 unit tests. The sparse world model integrates directly as a drop-in replacement for MedOS's ClinicalWorldModel, enabling long-horizon clinical episode simulation.