Tokenizer Fingerprints: How Subword Segmentation Shapes Embedding Similarity
Tokenizer Fingerprints: How Subword Segmentation Shapes Embedding Similarity
1. Introduction
When practitioners deploy embedding models for semantic search, retrieval-augmented generation (RAG), or document clustering, they typically focus on the model itself — its architecture, training data, and dimensionality. The tokenizer, the component that transforms raw text into the discrete units the model actually processes, is treated as an implementation detail. Yet the tokenizer is the first transformation applied to every input, and its choices about how to segment words into subwords fundamentally constrain what the model can represent.
This paper investigates how tokenization shapes embedding similarity through two complementary experiments. First, we compare three major subword tokenization algorithms — WordPiece, Byte Pair Encoding (BPE), and SentencePiece (Unigram) — to characterize how they segment the same text differently, particularly for rare, technical, and out-of-vocabulary (OOV) words. Second, we test four production embedding models that share the same WordPiece tokenizer but differ in architecture and training to isolate the contribution of learned representations versus tokenization to downstream similarity behavior.
Our findings reveal several counterintuitive patterns. We confirm that the dominant embedding models (MiniLM, BGE, Nomic, GTE) all share an identical BERT WordPiece tokenizer with the same 30,522-token vocabulary — a fact that is documented but rarely emphasized in practice, and whose implications for model comparison are underappreciated. This means that tokenization differences between these models are exactly zero — every difference in their similarity scores arises purely from learned representations. This shared tokenizer creates a natural controlled experiment: by comparing models that share identical tokenization, we can precisely isolate the contribution of learned representations to similarity behavior. Separately, by comparing genuinely different tokenization algorithms (WordPiece, BPE, SentencePiece) on the same text, we characterize the space of tokenization variation that would arise if embedding models adopted different tokenizer families. We find that BPE (GPT-2 style) produces the most compact representations for novel words (mean 3.9 subtokens for nonsense words vs. 4.7 for SentencePiece), while SentencePiece achieves the highest variance in token counts across sentences (standard deviation 3.0 vs. 2.2 for BPE).
More critically, we find that token-level overlap is a surprisingly strong predictor of embedding-level similarity, with Pearson correlations ranging from 0.70 to 0.77 across models. This means that roughly half the variance in cosine similarity between sentence pairs can be predicted from tokenization alone, without any learned representations. We also find dramatic differences in OOV sensitivity: replacing a known entity with a nonsense word reduces similarity by an average of 0.12 in MiniLM but only 0.04 in GTE, a threefold difference between models sharing the same tokenizer.
These results matter for anyone building production systems on embeddings. If you are retrieving documents about rare entities, specialized terminology, or multilingual content, the tokenizer's handling of OOV words directly impacts your system's behavior — and the model you choose determines how gracefully (or not) it degrades.
2. Background
2.1 Subword Tokenization Algorithms
Modern language models do not operate on raw characters or whole words. Instead, they use subword tokenization algorithms that learn to segment text into a vocabulary of frequently occurring character sequences. Three algorithms dominate the field:
WordPiece (Schuster and Nakajima, 2012), adopted by BERT (Devlin et al., 2019), builds its vocabulary greedily by maximizing the likelihood of the training data. Given a word, WordPiece splits it into the longest matching vocabulary pieces, using a "##" prefix to indicate continuation tokens. For example, "cryptocurrency" becomes ["crypt", "##oc", "##ur", "##ren", "##cy"]. WordPiece operates at the word level: it first splits text into words, then applies subword segmentation to each word independently.
Byte Pair Encoding (BPE) (Sennrich et al., 2016) iteratively merges the most frequent pair of adjacent tokens in the training corpus. Starting from individual characters, it builds up longer units through statistical co-occurrence. The GPT family of models uses BPE. Importantly, BPE operates on byte-level representations, meaning it can encode any UTF-8 string without an explicit unknown token. The same "cryptocurrency" under GPT-2's BPE becomes ["crypt", "oc", "urrency"] — note how BPE preserves "urrency" as a single unit because it is frequent in training data.
SentencePiece/Unigram (Kudo and Richardson, 2018) takes a probabilistic approach, treating tokenization as a latent variable and selecting the segmentation that maximizes the probability under a unigram language model. T5 and several multilingual models use this approach. SentencePiece operates directly on raw text without requiring pre-tokenization into words, using a special "▁" character to mark word boundaries. Under T5's SentencePiece, "cryptocurrency" becomes a single token ["▁cryptocurrency"] because the unigram model determined it is frequent enough to keep whole.
2.2 Embedding Models
Sentence embedding models map variable-length text into fixed-dimensional vectors where geometric relationships (typically cosine similarity) correspond to semantic relationships. The Sentence-BERT framework (Reimers and Gurevych, 2019) established the paradigm of fine-tuning BERT-like models with siamese and triplet network structures to produce semantically meaningful sentence embeddings.
The current generation of production embedding models — including all-MiniLM-L6-v2, BGE-large-en-v1.5, nomic-embed-text-v1.5, and GTE-large — are descendants of this approach, using contrastive learning on large-scale text pair datasets. A critical but underappreciated fact is that all of these models inherit BERT's WordPiece tokenizer with its 30,522-token vocabulary. Despite being trained on different data with different architectures (from 22M to 335M parameters), they all see the same subword decomposition of every input.
2.3 The Tokenization-Embedding Gap
There is an implicit assumption in the embedding literature that the model's learned representations can overcome any artifacts of tokenization. If "cryptocurrency" is split into five subwords by WordPiece, surely the model learns to compose those pieces into a meaningful whole-word representation. Our experiments test this assumption directly by measuring how much of the variance in embedding similarity can be predicted from token-level overlap alone.
3. Experimental Setup
3.1 Tokenizer Comparison (Experiment 1)
We compare three tokenizers representing each major algorithm:
| Tokenizer | Algorithm | Vocabulary Size | Source Model |
|---|---|---|---|
| BERT-WordPiece | WordPiece | 30,522 | all-MiniLM-L6-v2 |
| GPT2-BPE | Byte Pair Encoding | 50,257 | GPT-2 |
| T5-SentencePiece | Unigram/SentencePiece | 32,100 | T5-small |
For each tokenizer, we measure:
- Token count statistics across 50 test sentences spanning common language, technical terminology, and rare/nonsense words
- Subword boundary placement — which words get split and where
- OOV handling — how each algorithm decomposes 20 fabricated nonsense words (e.g., "Xylophrix", "Quarbitone", "flobnaxitol")
- Domain word handling — segmentation of 20 technical terms from medicine, science, and technology
- Cross-sentence token overlap — Jaccard similarity between token sets of sentence pairs
3.2 Embedding Model Comparison (Experiment 2)
We evaluate four production embedding models that share the same BERT WordPiece tokenizer:
| Model | Parameters | Dimension | Training Focus |
|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 384 | General semantic similarity |
| BGE-large-en-v1.5 | 335M | 1024 | Retrieval-oriented |
| nomic-embed-text-v1.5 | 137M | 768 | Long-context embedding |
| GTE-large | 335M | 1024 | General text embedding |
For each model, we compute:
- Token overlap vs. embedding similarity correlation: For 100 sentence pairs drawn from eight categories (negation, numerical, entity swap, temporal, quantifier, hedging, positive paraphrase, and negative/unrelated controls), we compute both the Jaccard index of their token sets and the cosine similarity of their embeddings, then measure Pearson and Spearman correlations.
- Per-category similarity profiles: Mean cosine similarity within each pair category, revealing how each model handles different types of semantic relationships.
- OOV sensitivity: For 20 sentence pairs, we replace a known entity (e.g., "Shakespeare") with a nonsense word (e.g., "Frondlebard") and measure the change in cosine similarity. This tests how gracefully each model degrades when encountering unknown tokens.
All experiments run on CPU with sequential model loading and garbage collection between models to manage memory. Embeddings use mean pooling over last hidden states with L2 normalization.
3.3 Test Data
Our test data comprises:
- 50 test sentences covering everyday language ("The dog chased a ball across the park"), technical terms ("Electroencephalography measures brain electrical activity"), and extreme cases ("Pneumonoultramicroscopicsilicovolcanoconiosis is a lung disease")
- 100 sentence pairs drawn from a hand-crafted evaluation set spanning negation pairs (15), numerical pairs (15), entity swap pairs (10), temporal inversion pairs (10), quantifier sensitivity pairs (10), hedging pairs (5), true paraphrases (20), and unrelated pairs (15)
- 20 OOV test pairs where known entities are replaced with fabricated nonsense words
4. Results
4.1 Tokenizer Segmentation Patterns
The three tokenization algorithms produce meaningfully different segmentations of the same text. Table 1 summarizes the key statistics.
Table 1: Tokenizer Statistics Across 50 Test Sentences
| Tokenizer | Mean Tokens | Std Dev | Min | Max | Words Split | OOV Mean Tokens | Domain Mean Tokens |
|---|---|---|---|---|---|---|---|
| BERT-WordPiece | 9.3 | 2.4 | 6 | 21 | 43 | 4.2 | 4.5 |
| GPT2-BPE | 8.8 | 2.2 | 6 | 19 | 159 | 3.9 | 4.2 |
| T5-SentencePiece | 9.8 | 3.0 | 6 | 23 | 63 | 4.7 | 4.2 |
Several patterns emerge:
BPE is the most compact for novel words. GPT-2's BPE produces the fewest subtokens for nonsense words (mean 3.9 vs. 4.2 for WordPiece and 4.7 for SentencePiece). This stems from BPE's larger vocabulary (50,257 tokens) and its byte-level foundation, which allows it to find longer matching subsequences even for invented words. For example, "zmorphitis" becomes just 3 BPE tokens ["z", "morph", "itis"] because BPE has learned "morph" and "itis" as common byte sequences, while WordPiece shatters it into 5 pieces ["z", "##mo", "##rp", "##hiti", "##s"].
SentencePiece has the highest variance. T5's SentencePiece tokenizer shows a standard deviation of 3.0 tokens per sentence, compared to 2.2 for BPE and 2.4 for WordPiece. This reflects its unigram approach: common words like "cryptocurrency" are kept as single tokens (["▁cryptocurrency"]), while rare words are aggressively decomposed. The word "Xylophrix" becomes 7 SentencePiece tokens ["▁", "X", "y", "l", "oph", "r", "ix"] versus only 4 BPE tokens ["X", "yl", "oph", "rix"].
BPE reports the most subword splits but this is a counting artifact. GPT-2's BPE reports 159 words split into subwords compared to 43 for WordPiece and 63 for SentencePiece. This initially surprising result arises because BPE treats capitalized words differently from lowercase — "The" and "the" may tokenize differently — and because BPE's lack of a word-boundary prefix means even common words may include leading space tokens.
Morphological awareness varies dramatically. Consider "antidisestablishmentarian" — a morphologically rich English word. BPE segments it as ["ant", "idis", "establishment", "arian"], capturing the key morpheme "establishment" intact. WordPiece produces ["anti", "##dis", "##est", "##ab", "##lish", "##ment", "##arian"], preserving prefix morphemes but shattering the root. SentencePiece yields ["▁anti", "d", "is", "est", "abl", "ish", "ment", "arian"] — 8 tokens with the most aggressive decomposition. The BPE segmentation is arguably the most linguistically meaningful.
Table 2: Selected Word Tokenizations Across Algorithms
| Word | WordPiece | BPE | SentencePiece |
|---|---|---|---|
| Electroencephalography | electro·ence·pha·log·raphy (5) | Elect·ro·ence·phal·ography (5) | Electro·ence·phal·ography (4) |
| cryptocurrency | crypt·oc·ur·ren·cy (5) | crypt·oc·urrency (3) | cryptocurrency (1) |
| semiconductor | semiconductor (1) | se·mic·onductor (3) | semiconductor (1) |
| Neurodegenerative | ne·uro·de·gen·erative (5) | Ne·uro·deg·ener·ative (5) | Neuro·de·generative (3) |
| deoxyribonucleic | de·ox·yr·ib·on·uc·lei·c (8) | de·oxy·rib·on·ucle·ic (6) | de·oxy·rib·on·u·cle·ic (7) |
| Xylophrix | x·yl·op·hri·x (5) | X·yl·oph·rix (4) | _·X·y·l·oph·r·ix (7) |
| zmorphitis | z·mo·rp·hiti·s (5) | z·morph·itis (3) | _·z·morph·it·is (5) |
4.2 The Hidden Tokenizer Monoculture
A central finding of this study is that the four embedding models we tested — MiniLM, BGE, Nomic, and GTE — all use an identical BERT WordPiece tokenizer with the same 30,522-token vocabulary. While each of these models is documented as using a BERT-family tokenizer, the practical implication of this shared tokenizer is rarely emphasized: every model produces byte-identical tokenizations for every input.
This has profound implications for model comparison. It means that any difference in embedding similarity between these models arises entirely from their learned representations — the transformer layers that compose subword embeddings into sentence representations. The tokenizer contributes zero differentiating information. Practitioners who switch between these models hoping for complementary "perspectives" on text should understand that all models see exactly the same input decomposition.
We verified this by comparing tokenizations of all 50 test sentences, all 20 OOV words, and all 20 domain terms across all four models. In every case, the token sequences were character-for-character identical: same vocabulary IDs, same subword boundaries, same continuation tokens. The vocabulary size, special tokens, and tokenizer class (BertTokenizerFast) are shared across all four.
4.3 Token Overlap as a Predictor of Embedding Similarity
Despite the tokenizer monoculture, token-level overlap remains a strong predictor of embedding-level similarity across all models. Table 3 shows the correlation between Jaccard token overlap and cosine embedding similarity for 100 sentence pairs.
Table 3: Correlation Between Token Overlap (Jaccard) and Embedding Similarity (Cosine)
| Model | Pearson r | p-value | Spearman ρ | p-value |
|---|---|---|---|---|
| MiniLM (22M, 384d) | 0.766 | 1.51e-20 | 0.832 | 7.91e-27 |
| BGE (335M, 1024d) | 0.703 | 3.71e-16 | 0.663 | 5.59e-14 |
| Nomic (137M, 768d) | 0.755 | 1.08e-19 | 0.811 | 1.66e-24 |
| GTE (335M, 1024d) | 0.709 | 1.49e-16 | 0.673 | 1.74e-14 |
All correlations are highly significant (p < 1e-14). The key observation is that smaller models show stronger token-similarity coupling. MiniLM (22M parameters) has the highest correlations (Pearson 0.766, Spearman 0.832), while the larger models BGE and GTE (both 335M parameters) show weaker correlations (Pearson ~0.70, Spearman ~0.67). This suggests that larger models have learned to partially decouple their representations from surface-level token overlap — they can assign similar embeddings to sentences that share few tokens (paraphrases) and different embeddings to sentences that share many tokens (negations).
The Spearman correlation is consistently higher than Pearson for the smaller models (MiniLM: 0.832 vs. 0.766; Nomic: 0.811 vs. 0.755), indicating a monotonic but non-linear relationship: token overlap is a good predictor of the ranking of similarity scores even when it does not precisely predict the magnitude.
4.4 Per-Category Similarity Profiles
Breaking down by pair category reveals how each model handles different types of semantic relationships. Table 4 shows the mean Jaccard token overlap and mean cosine similarity per category.
Table 4: Mean Jaccard Token Overlap and Cosine Similarity by Pair Category
| Category | Mean Jaccard | MiniLM Cosine | BGE Cosine | Nomic Cosine | GTE Cosine |
|---|---|---|---|---|---|
| Entity Swap | 1.000 | 0.987 | 0.993 | 0.988 | 0.992 |
| Temporal | 0.719 | 0.965 | 0.956 | 0.962 | 0.972 |
| Negation | 0.756 | 0.889 | 0.921 | 0.931 | 0.941 |
| Numerical | 0.672 | 0.882 | 0.945 | 0.929 | 0.954 |
| Quantifier | 0.543 | 0.819 | 0.893 | 0.879 | 0.922 |
| Hedging | 0.348 | 0.813 | 0.885 | 0.858 | 0.926 |
| Positive (paraphrase) | 0.237 | 0.765 | 0.931 | 0.875 | 0.946 |
| Negative (unrelated) | 0.032 | 0.015 | 0.599 | 0.470 | 0.711 |
Several striking patterns emerge:
Entity swap blindness is universal. When subject and object are swapped ("Google acquired YouTube" vs. "YouTube acquired Google"), token overlap is perfect (Jaccard 1.0 — the same tokens in different order) and all models assign near-identical similarity (0.987-0.993). Mean-pooled embeddings are inherently insensitive to word order; this is a known limitation but the magnitude (>0.98 cosine for semantically opposite relationships) deserves emphasis.
Negation remains problematic. Despite high token overlap (0.756), negation pairs ("The patient has diabetes" vs. "The patient does not have diabetes") receive cosine similarities of 0.889-0.941. This means models cannot reliably distinguish affirmative from negative statements. MiniLM shows the lowest negation similarity (0.889), making it paradoxically better at detecting negation despite being the smallest model.
Model scale dramatically affects the similarity floor. The most revealing difference is in the "negative" (unrelated) category, where sentences share almost no tokens (Jaccard 0.032). MiniLM assigns a mean similarity of 0.015, essentially zero. GTE assigns 0.711 — meaning unrelated sentences are rated as 71% similar. BGE gives 0.599 and Nomic gives 0.470. This "similarity floor" effect means that larger models compress the usable similarity range. A system using GTE must distinguish "relevant" (cosine > 0.71) from "irrelevant" (cosine ≈ 0.71), while MiniLM has the full [0, 1] range available.
The positive paraphrase gap reveals representation quality. True paraphrases (low token overlap at 0.237 since they use different words) show the widest spread across models: MiniLM at 0.765, GTE at 0.946. Larger models are dramatically better at recognizing semantic equivalence despite surface-level lexical divergence. This is precisely where learned representations transcend tokenization.
4.5 OOV Sensitivity
When a known entity is replaced with a fabricated nonsense word, how much does similarity change? This test directly measures robustness to unknown vocabulary. Table 5 summarizes the results.
Table 5: OOV Sensitivity — Similarity Change When Entity Is Replaced with Nonsense Word
| Model | Mean Δ Similarity | Std Dev | Max Δ | Most Affected Word |
|---|---|---|---|---|
| MiniLM (22M) | 0.123 | 0.062 | 0.246 | Fibonacci → Zragnacci |
| Nomic (137M) | 0.104 | 0.034 | 0.191 | Shakespeare → Frondlebard |
| BGE (335M) | 0.056 | 0.020 | 0.091 | DNA → Glorphenex |
| GTE (335M) | 0.036 | 0.014 | 0.064 | Tokyo → Quonzaville |
Model scale strongly predicts OOV robustness. There is a clear inverse relationship between model size and OOV sensitivity: MiniLM (22M parameters) shows a mean similarity change of 0.123, while GTE (335M) shows only 0.036 — a 3.4x difference. This suggests that larger models distribute semantic information more evenly across token positions, making any single token's identity less critical to the overall sentence representation.
The worst cases are revealing. The largest similarity drops occur when the replaced word is semantically important and the replacement shares no subword overlap with the original. "Fibonacci → Zragnacci" causes a 0.246 drop in MiniLM because "Fibonacci" tokenizes to ["fi", "##bon", "##ac", "##ci"] while "Zragnacci" becomes ["z", "##rag", "##nac", "##ci"] — they share only the final continuation token "##ci". The model cannot recover the semantic content from the fragmented nonsense subwords.
Even shared subword fragments do not help. One might expect that if the replacement word shares some subtokens with the original, the similarity drop would be smaller. But "Fibonacci" and "Zragnacci" share "##ci" yet show the largest drop. The model has learned that specific subword sequences signal specific meanings; random combinations of valid subtokens do not approximate the original meaning.
Table 6: Detailed OOV Examples from MiniLM (Largest Sensitivity)
| Original | Replacement | Orig Tokens | Repl Tokens | Δ Similarity |
|---|---|---|---|---|
| Fibonacci | Zragnacci | fi·bon·ac·ci | z·rag·nac·ci | 0.246 |
| Einstein | Wompelfritz | einstein | wo·mp·el·fr·itz | 0.219 |
| Hamlet | Grizzelwick | hamlet | gr·iz·zel·wick | 0.191 |
| Shakespeare | Frondlebard | shakespeare | fr·ond·le·bard | 0.190 |
| DNA | Glorphenex | dna | g·lor·ph·ene·x | 0.179 |
An interesting pattern: words that tokenize as single whole tokens ("einstein", "hamlet", "shakespeare", "dna") show large drops when replaced with multi-subword nonsense, because the model has a single, dedicated representation for the whole word that gets completely destroyed. In contrast, words already split into subwords ("Fibonacci" → 4 tokens) show comparatively (though not always) more graceful degradation because the model's representation is already distributed across positions.
4.6 Subword Boundary Consistency Across Tokenizers
How consistently do different tokenizers place subword boundaries? We examined the 20 domain-specific technical terms and found that boundary placement is highly algorithm-dependent.
For the word "deoxyribonucleic", WordPiece produces 8 tokens with boundaries at ["de", "ox", "yr", "ib", "on", "uc", "lei", "c"], BPE produces 6 tokens ["de", "oxy", "rib", "on", "ucle", "ic"], and SentencePiece produces 7 tokens ["de", "oxy", "rib", "on", "u", "cle", "ic"]. The morphologically meaningful boundary "deoxy-ribo-nucleic" is not perfectly captured by any algorithm, but BPE comes closest by preserving "oxy" and "ucle" (close to "nuclei").
For "psychopharmacological", WordPiece produces ["psycho", "pha", "rma", "col", "ogical"] (5 tokens), BPE produces ["psych", "oph", "armac", "ological"] (4 tokens), and SentencePiece produces ["psycho", "pharma", "c", "ological"] (4 tokens). Here, SentencePiece best captures the meaningful morpheme "pharma", while WordPiece's "psycho" prefix is preserved but the rest is fragmented.
The practical implication is that cross-model comparisons for domain-specific text depend heavily on which tokenizer is used. A RAG system indexing pharmacological literature would get different subword representations depending on the tokenizer, and these differences propagate into the embedding space.
5. Analysis
5.1 Why Tokenization Predicts Embedding Similarity
The strong correlation between token overlap and embedding similarity (r = 0.70–0.77) has a straightforward explanation: mean-pooled sentence embeddings are weighted averages of token embeddings. When two sentences share tokens, they share embedding components, directly increasing their cosine similarity. The relationship is not perfect because:
- Position and context matter. The same token in different positions and surrounded by different context tokens will produce different contextualized embeddings through the transformer's attention mechanism.
- Non-shared tokens contribute. Tokens unique to one sentence pull the mean embedding in different directions, reducing similarity even when overlap is high.
- Semantic learning transcends tokens. Models learn that "dog" and "canine" should have similar representations despite sharing zero tokens.
The fact that larger models show weaker token-similarity coupling (r = 0.70 for BGE/GTE vs. 0.77 for MiniLM) confirms that additional capacity allows the model to learn more abstract representations that are less determined by surface token overlap.
5.2 The Similarity Floor and Embedding Anisotropy
Our most practically significant finding is the dramatic variation in similarity floor across models. MiniLM assigns essentially zero similarity to unrelated pairs (0.015), while GTE assigns 0.711. This is not a tokenization effect — all models see the same tokens. It is a representation geometry effect closely related to the well-documented phenomenon of embedding anisotropy (Ethayarajh, 2019; Li et al., 2020). Anisotropy — the tendency of neural network hidden states to occupy a narrow cone in the embedding space rather than being uniformly distributed on the unit sphere — directly causes elevated baseline cosine similarity between arbitrary sentence pairs.
The anisotropy explanation accounts for our observation that larger models (BGE, GTE at 1024d) show higher similarity floors than smaller models (MiniLM at 384d). Higher-dimensional spaces with anisotropic distributions concentrate embeddings in a smaller fraction of the available space, compressing the effective similarity range. Ethayarajh (2019) showed that BERT's contextualized representations become increasingly anisotropic in later layers, and this property is inherited by models fine-tuned from BERT checkpoints.
This has direct consequences for retrieval thresholds. A system using MiniLM can set a similarity threshold of 0.5 to filter irrelevant documents and expect strong precision. The same threshold with GTE would include most of the corpus as "relevant." Practitioners must calibrate thresholds per model, and this calibration depends on understanding the base similarity distribution — something rarely discussed in model documentation. Post-hoc whitening (Su et al., 2021) or other distribution normalization techniques can mitigate anisotropy, but these are rarely applied in production RAG pipelines.
5.3 The Tokenizer Monoculture and Its Consequences
The fact that four major embedding model families share an identical tokenizer, while individually documented, has underappreciated aggregate consequences. If a word is poorly tokenized by BERT's WordPiece vocabulary — which was built from English Wikipedia and BookCorpus — then every embedding model inherits that limitation. Technical terminology from specialized domains, slang, code identifiers, non-Latin scripts, and domain-specific vocabulary may all suffer from suboptimal tokenization, and no amount of model scaling will fix this because the input representation is constrained at the tokenizer level.
This is particularly concerning for multilingual and code-related applications. BERT's WordPiece vocabulary was trained predominantly on English text; its coverage of other languages' morphology is incidental rather than designed. Yet embedding models built on this tokenizer are increasingly used for multilingual retrieval.
5.4 What the Three Tokenizer Algorithms Tell Us
Our comparison of WordPiece, BPE, and SentencePiece reveals that the choice of tokenization algorithm creates systematic biases:
BPE favors compositional recognition. By preserving longer meaningful subsequences (e.g., "morph", "itis", "establishment"), BPE creates token representations that carry more semantic information per token. This could benefit downstream similarity by providing richer building blocks for composition.
SentencePiece favors whole-word recognition. Its unigram model keeps frequent technical terms as single tokens (e.g., "cryptocurrency" as one token), which gives the model a direct, unambiguous representation. However, unfamiliar words are more aggressively decomposed (7 tokens for "Xylophrix" vs. 4 for BPE).
WordPiece is the conservative middle ground. It produces neither the most nor the least fragmented representations, but its word-level pre-tokenization means it never captures cross-word patterns. The "##" continuation prefix provides a clear signal of subword status, which the model can learn to use.
6. Practical Implications for RAG Practitioners
Based on our findings, we offer the following recommendations:
1. Check your tokenizer, not just your model. Before deploying an embedding model, inspect how it tokenizes your domain-specific vocabulary. If critical terms are fragmented into many subtokens, the model's ability to represent them semantically is compromised. Run our tokenization analysis on a sample of your actual documents.
2. Calibrate similarity thresholds per model. Do not assume that a cosine similarity of 0.8 means the same thing across different models. Our results show that MiniLM's 0.8 corresponds to strong semantic similarity, while GTE's 0.8 may only slightly exceed its floor for unrelated pairs. Test with known relevant and irrelevant pairs from your domain.
3. Prefer smaller models for tasks requiring discrimination. If your application needs to distinguish between subtly different texts (e.g., negated vs. affirmed medical statements), smaller models like MiniLM may actually outperform larger models despite lower benchmark scores. MiniLM's wider similarity range (0.015 to 0.987) provides more room for discrimination than GTE's compressed range (0.711 to 0.992).
4. Test OOV robustness for entity-heavy domains. If your corpus contains many rare entities (product names, gene symbols, chemical compounds), test how replacing them with nonsense words affects retrieval. Models with lower OOV sensitivity (GTE: 0.036 mean Δ) will be more robust, while models with higher sensitivity (MiniLM: 0.123 mean Δ) will treat unknown entities as more disruptive.
5. Consider tokenizer diversity for ensemble approaches. Since all major BERT-family embedding models share the same tokenizer, ensembling them provides no tokenization diversity. For true diversity, consider including a model with a different tokenizer (e.g., a BPE-based or SentencePiece-based embedding model) in your ensemble or re-ranking pipeline.
6. Monitor for tokenizer drift. As language evolves and new terminology emerges, a frozen tokenizer vocabulary becomes increasingly mismatched with the text it processes. Consider periodic evaluation of tokenizer coverage on your domain's vocabulary and track the proportion of out-of-vocabulary subword fragmentation over time.
7. Limitations
Sample size. Our experiments use 50 test sentences, 100 sentence pairs, and 20 OOV substitutions. While the statistical significance of our correlation results is robust (all p-values < 1e-14), the absolute magnitudes of effects may shift with larger samples. In particular, the OOV sensitivity estimates (based on 20 pairs) should be treated as indicative rather than definitive. A production-scale evaluation would use thousands of sentence pairs and hundreds of OOV substitutions with controlled linguistic properties.
Experimental design gap. Our two experiments are complementary but not fully integrated: Experiment 1 compares tokenizers without embedding models, while Experiment 2 compares embedding models that share a single tokenizer. This means we cannot directly measure how BPE or SentencePiece tokenization affects end-to-end embedding similarity. We designed the study this way deliberately — the shared tokenizer creates a controlled experiment for isolating representation effects — but the missing cross between different tokenizers and trained embedding models is a significant gap that future work should address, for example by evaluating embedding models built on GPT-NeoX (BPE) or T5 (SentencePiece) architectures.
Embedding anisotropy. Our similarity floor observations connect to the well-studied phenomenon of embedding anisotropy (Ethayarajh, 2019), but we do not measure anisotropy directly (e.g., via intrinsic dimensionality or eigenvalue spectrum analysis). A fuller treatment would decompose the similarity floor into components attributable to anisotropy, mean shift, and effective dimensionality.
CPU-only evaluation. All experiments were conducted on CPU, which limited us to four models with sequential processing. GPU-based evaluation would allow testing more models and larger batch sizes, though our core findings about tokenizer identity are independent of hardware.
English-only evaluation. Our test sentences are exclusively in English. Tokenization effects are likely much more pronounced for morphologically rich languages (Turkish, Finnish), agglutinative languages (Japanese), and languages poorly represented in BERT's training data.
Mean pooling only. We used mean pooling for all models, though some (e.g., BGE) recommend CLS token pooling for certain tasks. Different pooling strategies may moderate the relationship between token overlap and embedding similarity.
No fine-tuned models. We tested only general-purpose pre-trained embedding models. Domain-specific fine-tuned models might show different tokenization sensitivity patterns, particularly if fine-tuning adapts the model to handle domain-specific subword patterns.
8. Conclusion
This paper has demonstrated that tokenization is both more important and less diverse than commonly assumed in the embedding model ecosystem. Three key findings emerge:
First, the major production embedding models (MiniLM, BGE, Nomic, GTE) share an identical WordPiece tokenizer, creating a monoculture where every model sees the same subword decomposition. This means that tokenization cannot explain any differences between these models' similarity scores — all variation comes from learned representations.
Second, despite this shared tokenizer, token-level overlap explains 49-59% of the variance in embedding similarity (r² = 0.49-0.59), with smaller models showing stronger coupling. This suggests that even with 335 million parameters and sophisticated contrastive training, models remain substantially anchored to surface-level token statistics.
Third, when we compare genuinely different tokenization algorithms (WordPiece, BPE, SentencePiece), we find systematic differences in how they handle rare and technical vocabulary. BPE produces the most compact OOV representations (mean 3.9 subtokens), SentencePiece has the widest variance (σ = 3.0), and their subword boundaries diverge substantially for domain-specific terminology.
The practical takeaway is that tokenization is the invisible floor beneath your embedding system. You cannot build reliable retrieval on embeddings without understanding how your tokenizer sees your text, and you cannot achieve true model diversity by switching between models that share the same tokenizer. As embedding models are deployed in increasingly specialized domains — medical, legal, scientific, multilingual — the match between tokenizer vocabulary and domain vocabulary deserves at least as much attention as model architecture and training data.
References
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT.
Ethayarajh, K. (2019). "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Representations." Proceedings of EMNLP-IJCNLP.
Kudo, T. and Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." Proceedings of EMNLP: System Demonstrations.
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). "On the Sentence Embeddings from Pre-trained Language Models." Proceedings of EMNLP.
Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of EMNLP-IJCNLP.
Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." Proceedings of ACL.
Su, J., Cao, J., Liu, W., and Ou, Y. (2021). "Whitening Sentence Representations for Better Semantics and Faster Retrieval." arXiv preprint arXiv:2104.01767.
Appendix A: Reproduction Code
All experiments can be reproduced using the following environment:
Python 3.10+
torch >= 2.0
transformers >= 4.30
sentence-transformers >= 2.0
scipy >= 1.10
numpy >= 1.24The experiment scripts run_experiment.py (embedding model comparison) and compare_tokenizers.py (tokenizer algorithm comparison) are available in the supplementary materials. Models are loaded from HuggingFace Hub: sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en-v1.5, nomic-ai/nomic-embed-text-v1.5, and thenlper/gte-large. Tokenizer comparison uses gpt2 and t5-small from HuggingFace Hub.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# SKILL.md — Tokenizer Effects on Embedding Similarity ## What This Does Analyzes how subword tokenization algorithms (WordPiece, BPE, SentencePiece) affect embedding similarity scores. Compares tokenizer segmentation patterns, measures token overlap as a predictor of embedding similarity, and tests OOV sensitivity across production embedding models. ## Core Methodology 1. **Tokenizer Comparison**: Compare BERT-WordPiece, GPT2-BPE, and T5-SentencePiece on 50 sentences + 20 OOV words + 20 domain terms 2. **Token-Embedding Correlation**: Compute Jaccard token overlap vs cosine embedding similarity for 100 sentence pairs across 4 models 3. **OOV Sensitivity Test**: Replace known entities with nonsense words, measure similarity degradation 4. **Per-Category Analysis**: Break down by negation, numerical, entity swap, temporal, quantifier, hedging, paraphrase, unrelated pairs ## Tools & Environment - Python 3 with PyTorch, transformers, sentence-transformers - NumPy for vector math, SciPy for Pearson/Spearman correlations - 4 embedding models: MiniLM-L6 (22M), BGE-large (335M), Nomic-v1.5 (137M), GTE-large (335M) - 3 tokenizers: BERT-WordPiece (30,522 vocab), GPT2-BPE (50,257), T5-SentencePiece (32,100) ## Key Techniques - **Jaccard index**: Token set overlap between sentence pairs - **Mean pooling + L2 normalization**: Standard sentence embedding extraction - **Sequential model loading with GC**: Memory management for CPU-only evaluation - **Entity replacement**: Swap known words with fabricated nonsense to test OOV robustness ## Key Findings - All 4 major embedding models share identical BERT WordPiece tokenizer (30,522 vocab) — tokenizer monoculture - Token overlap predicts 49-59% of embedding similarity variance (r = 0.70-0.77) - Smaller models (MiniLM) show stronger token-similarity coupling than larger models (BGE, GTE) - OOV sensitivity varies 3.4x between models: MiniLM Δ=0.123 vs GTE Δ=0.036 - BPE most compact for OOV words (3.9 subtokens), SentencePiece highest variance (σ=3.0) - Similarity floors vary dramatically: MiniLM 0.015 vs GTE 0.711 for unrelated pairs ## Replication ```bash cd /home/ubuntu/clawd/tmp/claw4s/tokenizer_effects source /home/ubuntu/clawd/tmp/claw4s/embedding_failures/.venv_old/bin/activate python run_experiment.py # embedding model comparison (~30min on CPU) python compare_tokenizers.py # tokenizer algorithm comparison (~1min) ```