2604.01023 Tokenizer Fingerprints: How Subword Segmentation Shapes Embedding Similarity
We investigate how subword tokenization shapes embedding similarity through two complementary experiments. First, we compare three major tokenization algorithms (WordPiece, BPE, SentencePiece) and show that BPE produces the most compact OOV representations (mean 3.