Sparse-Mixture Routing for Domain-Specific LLM Serving at Scale
Sparse-Mixture Routing for Domain-Specific LLM Serving at Scale
1. Introduction
Serving fine-tuned LLMs at scale is increasingly a multi-tenant problem. Each tenant — a law firm, a hospital network, a trading desk — wants its own adapter or fine-tune over a shared base model. Existing systems such as LoRAX [Sheng et al. 2023] and Punica [Chen et al. 2024] swap LoRA adapters in and out of GPU memory based on incoming request streams. As the tenant count grows past ~50, however, swap overhead and memory contention dominate, and aggregate throughput collapses.
We argue that the right abstraction is not adapter-swap but sparse-mixture routing: at serving time, the system maintains a small active set of experts (where , the total tenant count) and routes each query to its nearest active expert, falling back to swap-in only when necessary.
2. Background
Let be tenant adapters, each of size MB. Total adapter memory is , which exceeds GPU capacity for in typical configurations. LoRAX-style swap maintains a cache of recently used adapters; under skewed tenant traffic this works well, but under flat or shifting traffic, swap thrashing dominates.
3. Method: SMR Architecture
3.1 Routing layer
For each query with tenant tag , SMR computes a routing distribution
where is a small CPU-side embedding of the query, is a learned routing matrix, and is a tenant-affinity bias that prefers the tenant's own adapter when active. The system selects the top-1 expert.
3.2 Active-set management
The active set is updated every ms (default 200ms) using an LFU-with-decay policy:
where is the count of requests for tenant in the window. Top- tenants by are kept active.
3.3 Cold-tenant handling
Queries for non-active tenants are buffered in a swap queue. When swap-in completes, the queued requests are batched together to amortize kernel launch overhead.
def serve(query, tenant):
if tenant in active_set:
return run_inference(query, adapters[tenant])
swap_queue[tenant].append(query)
if not swap_in_progress[tenant]:
schedule_swap_in(tenant)4. Experimental Setup
Hardware: 8 A100-80GB. Base model: Llama-3-8B in fp16. Adapters: 64 LoRA adapters of rank 16, each 18 MB. Workload: synthesized from a 7-day production trace with Pareto-distributed tenant traffic ().
5. Results
5.1 Throughput
| System | Throughput (tok/s) | P95 Latency (ms) | GPU Memory (GB) |
|---|---|---|---|
| Per-tenant | 412 | 38 | 64.2 |
| LoRAX | 1{,}847 | 71 | 41.0 |
| Punica | 2{,}210 | 64 | 39.8 |
| SMR (ours) | 6{,}830 | 75 | 38.3 |
SMR delivers a 3.7 throughput gain over LoRAX. The P95 latency is 11ms higher than LoRAX, attributable to occasional swap-in events for cold tenants.
5.2 Routing Accuracy
The learned router achieves 94.6% top-1 tenant accuracy (the trivial "always trust the tag" baseline gets 100% but cannot benefit from cross-tenant cache reuse). The 5.4% misroute rate is acceptable in the regime where similar tenants share fine-tuning data.
5.3 Sensitivity
- Active set size is the sweet spot; adds marginal throughput at higher memory cost.
- Decay balances responsiveness and stability.
6. Discussion and Limitations
SMR assumes that adapter compositions are small and uniform-rank; full fine-tunes break the model. It also depends on tenant traffic having moderate skew — under perfectly uniform traffic, must approach and benefits vanish. Privacy-isolated tenants who require strict per-tenant compute do not benefit; SMR is targeted at SaaS providers with bursty mixed workloads.
7. Conclusion
Sparse-mixture routing reframes multi-tenant LLM serving as a top- expert problem and yields substantial throughput improvements with modest latency cost. We release the SMR scheduler as a drop-in replacement for the LoRAX router.
References
- Sheng, Y. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
- Chen, L. et al. (2024). Punica: Multi-Tenant LoRA Serving.
- Fedus, W. et al. (2022). Switch Transformers.
- Hu, E. et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models.
- Kwon, W. et al. (2023). vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.