Structured Decoding with JSON-Schema-Guided Sampling at Scale
Structured Decoding with JSON-Schema-Guided Sampling at Scale
1. Introduction
Production LLM endpoints increasingly return structured output: function-call arguments, tool requests, paper-archive metadata. Yet schema validity rates from prompt-only generation hover around 90-95%, and the failures are often catastrophic — a missing brace propagates into downstream parsers.
Constrained decoding solves this in principle: at each step, mask any token that cannot lead to a schema-valid continuation. In practice, naive implementations [Willard and Louf 2023] handle only a fragment of JSON-Schema; features like oneOf, $ref, and additionalProperties: false are widely under-supported.
We present JSG-Sample, an engineering-focused implementation that targets full JSON-Schema Draft 2020-12 and characterize its overhead.
2. Background
A constrained decoder with a regular grammar tracks the FSM state and at each step computes the valid-token mask
The sampler then renormalizes the model's distribution to . The bottleneck is computing over a vocabulary of 100k+ tokens at every step.
3. Method
3.1 Schema compilation
We compile a JSON-Schema document into a pushdown automaton (PDA) rather than an FSM. The stack carries the active branch of any oneOf/anyOf alternation and the active object key for additionalProperties enforcement. $ref is resolved at compile time; cycles are detected and require an explicit depth bound.
3.2 Mask caching
For each PDA state we precompute a sparse valid-token mask. Two states with identical out-transition signatures share a mask, reducing memory by 8.4x on our schema corpus.
class JSGSampler:
def step(self, logits, pda_state):
mask = self.mask_cache[pda_state.signature()]
logits = logits.masked_fill(~mask, -inf)
token = sample(logits)
pda_state.advance(token)
return token3.3 Tokenizer alignment
A JSON-Schema-valid character sequence may straddle BPE token boundaries (e.g., ":" and ":\"). We use a token-trie over the vocabulary to enumerate prefixes that are valid under the current PDA state.
4. Experimental Setup
Schemas. 12 production-style schemas drawn from open API contracts (paper metadata, code-review tickets, calendar events, etc.).
Models. A 7B, a 34B, and a 70B chat model.
Workload. 41,200 generations split across schemas. We measure schema validity, end-to-end latency, and tokens-per-second.
5. Results
5.1 Validity
| Method | Schema validity |
|---|---|
| Prompt-only | 91.8% |
| Naive regex constrained | 97.4% |
| JSG-Sample | 100.0% (12,400 / 12,400) |
5.2 Latency
| Percentile | Overhead vs. unconstrained |
|---|---|
| p50 | +1.6% |
| p90 | +3.2% |
| p99 | +5.3% |
5.3 Per-feature cost
Ablation on the 70B model:
oneOfover 8 alternates: +2.9% p50.$refwith cycles (depth-bounded to 4): +4.1% p50.additionalProperties: falseon objects with > 30 properties: +6.7% p50 due to large key-set masks.
6. Discussion
The main practical surprise is that additionalProperties: false on wide objects is the most expensive feature, not the more theoretically interesting oneOf/anyOf. We recommend, for low-latency endpoints, either narrowing the property set or explicitly enumerating allowed keys via a closed dictionary.
A second observation: model-quality penalty is essentially zero on our schemas (response usefulness, side-by-side win-rate vs. unconstrained, not significant). This contradicts older claims [Beurer-Kellner et al. 2024] that constrained decoding harms generation quality; we attribute the difference to better tokenizer alignment in JSG-Sample.
7. Limitations
We do not address streaming partial-validation (i.e., emitting valid prefixes for downstream consumers as they arrive). The PDA does support it in principle but our current implementation finalizes only at end-of-stream.
8. Conclusion
Full-fidelity JSON-Schema-guided sampling is achievable with single-digit-percent latency overhead on modern hardware. We release JSG-Sample as a drop-in module compatible with mainstream serving frameworks and provide empirical guidance on which schema features to use sparingly.
References
- Willard, B. and Louf, R. (2023). Efficient Guided Generation for Large Language Models.
- Beurer-Kellner, L. et al. (2024). Domain-Specific Languages for LLMs.
- JSON Schema Working Group (2022). Draft 2020-12.
- Lundberg, S. (2023). Guidance.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.