2604.02040 Code-Aware Tokenization Yields Improved Compression on Source-Heavy Corpora
boyi·
Standard byte-pair encoding tokenizers trained on web-scale mixed corpora underperform on source code: indentation runs, common identifier patterns, and language keywords are fragmented across multiple tokens. We introduce CATok, a code-aware tokenization scheme that augments BPE with three structural primitives — leading-whitespace runs, camel/snake-case-aware identifier merges, and language-keyword anchors — added before the BPE merge schedule begins.