Code-Aware Tokenization Yields Improved Compression on Source-Heavy Corpora
Code-Aware Tokenization Yields Improved Compression on Source-Heavy Corpora
1. Introduction
Tokenizers shape what a language model can efficiently represent. For natural language, byte-pair encoding (BPE) [Sennrich et al. 2016] produces near-optimal codes given enough training data. For source code, however, BPE inherits artifacts that hurt downstream models: long runs of leading whitespace are split into many short tokens, identifiers like getUserAccountById are fragmented unpredictably, and language keywords compete with rare strings for vocabulary slots.
We propose CATok, a tokenization scheme that prepends a small number of structural merges to the BPE schedule. The structural merges are language-aware but generic enough to be encoded once per language family rather than per project. We show that CATok yields meaningful compression and perplexity gains while remaining a drop-in replacement for BPE.
2. Background
Prior work on code tokenization includes AST-aware tokenizers [Kim et al. 2021] and CodeBERT-style word-piece variants [Feng et al. 2020]. AST-aware schemes incur a parsing cost at tokenization time and do not gracefully handle syntactically broken code, which is common during interactive editing. Word-piece variants improve over naive BPE but still fragment whitespace runs.
3. Method
CATok pre-seeds the merge table with three families of merges before BPE training begins:
Whitespace runs. For each , we add tokens for spaces and tabs. Indentation in Python files alone accounts for roughly 13% of bytes; collapsing these into single tokens dramatically shortens encoded length.
Case-aware identifier anchors. We tokenize identifiers using camelCase and snake_case boundaries before BPE sees them, so that getUserAccountById is parsed as [get, User, Account, By, Id] rather than allowing BPE to learn idiosyncratic cross-word merges.
Keyword anchors. For each of 47 languages we ship a frozen list of keywords (typically 20-60 per language) that are reserved as atomic tokens. This guarantees that function, def, lambda, and similar appear as single tokens regardless of corpus frequency.
Formally, if is the BPE vocabulary budget and is the structural budget, we use total tokens; in our experiments K and K for a K total. The structural tokens are shielded from BPE merge competition.
def encode(text, structural_table, bpe_table):
pieces = apply_structural_merges(text, structural_table)
out = []
for p in pieces:
out.extend(bpe_encode(p, bpe_table) if not p.is_structural else [p])
return out4. Experimental Setup
We assemble a 312 GB corpus from 47 programming languages, weighted approximately by GitHub stars to approximate practical usage. Baselines are (a) GPT-2 BPE, (b) a freshly trained BPE on the same corpus with the same vocabulary size, and (c) StarCoder's tokenizer.
We evaluate compression as mean tokens per UTF-8 byte. Downstream impact is evaluated by training 1.3B-parameter decoder-only transformers from scratch with each tokenizer for 50B tokens of code.
5. Results
Compression. CATok achieves tokens/byte versus for BPE-same-vocab and for GPT-2 BPE. Per-language gains range from (Assembly) to (Python). Languages with significant indentation conventions benefit most.
Perplexity. At matched parameter count and matched byte budget (so CATok models see fewer tokens but the same data), CATok models reach nats/byte versus for BPE-same-vocab on a held-out test set (, replicates).
Effective context. Because each token covers more bytes on average, a fixed 8192-token context fits about 17% more code by volume. Practical implication: a 100-line Python file that previously occupied 4.1K BPE tokens now fits in 3.4K CATok tokens.
Failure modes. On heavily minified JavaScript and on languages with whitespace-insensitive grammars used in non-idiomatic styles, the gains shrink to single digits. CATok never lost to BPE-same-vocab in our tests but could underperform AST-aware tokenizers on small, well-formed corpora where parsing is cheap.
6. Discussion
The gains we report are real but bounded by Shannon's source-coding limit: there is some fundamental entropy of code that no tokenizer can compress below. CATok mostly recovers efficiency that BPE leaves on the table due to its frequency-only merge criterion. As corpora grow, the marginal benefit of CATok over BPE shrinks; we estimate the gap closes to as training corpus size approaches TB.
A practical caveat is that CATok tokens are not byte-prefix-free in the way GPT-2 BPE is, so detokenization requires the full structural table. We provide a compatibility layer for tools that assume prefix-free encodings.
7. Conclusion
Simple, language-aware structural primitives layered atop BPE yield a 14.6% mean reduction in token count on a large code corpus, with downstream perplexity and effective-context benefits that justify adoption. CATok requires no architectural change and adds negligible overhead at inference.
References
- Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units.
- Feng, Z. et al. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
- Kim, S. et al. (2021). AST-Aware Tokenization for Source Code.
- Li, R. et al. (2023). StarCoder: May the Source Be With You.
- Karpathy, A. (2024). minBPE. Open-source.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.