Negation Blindness in Sentence Embeddings: A Systematic Analysis of How Neural Models Process Semantic Reversal
Negation Blindness in Sentence Embeddings: A Systematic Analysis of How Neural Models Process Semantic Reversal
Abstract
Negation is among the most fundamental operations in natural language—a single word like "not" can completely reverse the truth value of a proposition. Yet sentence embedding models, which form the backbone of modern retrieval and semantic search systems, are remarkably blind to negation. We present an empirical analysis of negation processing across four bi-encoder models (MiniLM, BGE-large, Nomic-embed, GTE-large) and four cross-encoder models (STS-B RoBERTa, MS-MARCO MiniLM, BGE-reranker, Quora RoBERTa), evaluated on 15 negation pairs in the bi-encoder evaluation and 55 negation pairs in the cross-encoder evaluation, spanning medical, legal, financial, and general domains. Our results reveal that bi-encoders assign cosine similarities of 0.889–0.941 to sentence pairs with opposite meanings, with 73–100% of negation pairs scoring above the 0.85 threshold commonly used in production retrieval systems. We introduce the token dilution hypothesis to explain this failure: in a sentence of N tokens processed through mean pooling, a single negation token contributes only 1/N of the embedding signal, causing it to be overwhelmed by shared content tokens. We empirically validate this hypothesis by computing predicted dilution bounds for each pair and showing that the mean prediction (0.904) closely tracks observed cosine scores (0.889–0.941). Cross-encoder architectures partially address this blindness—BGE-reranker and Quora RoBERTa correctly assign low similarity to negation pairs—but the MS-MARCO cross-encoder, trained for passage relevance rather than semantic similarity, actually rates negated pairs as highly relevant (mean score 8.27/10). We discuss how these findings extend to alternative pooling strategies ([CLS] token, max pooling) and provide practical mitigation recommendations. While prior work has documented negation insensitivity in masked language modeling (Ettinger, 2020; Kassner and Schutze, 2020), we provide the first systematic quantification of this failure in the sentence embedding and retrieval setting, showing that production retrieval systems are vulnerable to returning results with the exact opposite meaning of the query.
1. Introduction
Consider a clinical information system processing the query "The patient has diabetes." A competent retrieval system should not return "The patient does not have diabetes" as a top match. Yet this is precisely what modern sentence embedding models do. In our experiments, the cosine similarity between these two sentences ranges from 0.839 (MiniLM) to 0.947 (GTE-large)—scores that would place the negated sentence among the top results in any nearest-neighbor search.
This failure is not a minor edge case. Negation is one of the most common semantic operations in natural language processing. In medical records, the distinction between "patient has condition X" and "patient does not have condition X" is literally a matter of life and death. In legal documents, the difference between "the contract is valid" and "the contract is not valid" determines the outcome of cases. In financial reports, "revenue is growing" versus "revenue is not growing" drives investment decisions worth billions.
The fundamental problem is architectural. Modern sentence embedding models (Reimers and Gurevych, 2019) produce fixed-dimensional vector representations through mean pooling over token embeddings from transformer encoders (Devlin et al., 2019). When a sentence of ten tokens is modified by inserting a single "not," the resulting embedding is a weighted average where the negation token contributes approximately one-eleventh of the signal. The nine shared content tokens dominate the representation, producing embeddings that are geometrically close despite being semantically opposite.
This paper makes three contributions. First, we provide a systematic empirical characterization of negation blindness across four bi-encoder architectures and four cross-encoder architectures, demonstrating that the problem is universal among bi-encoders but partially addressable through cross-attention mechanisms. Second, we introduce and quantify the token dilution hypothesis, showing that the severity of negation blindness is mechanistically linked to the ratio of negation tokens to total tokens. Third, we analyze the practical failure rates for production retrieval thresholds, finding that 73–100% of negation pairs would be returned as false positive matches, and discuss mitigation strategies for negation-aware retrieval pipelines.
2. Background and Related Work
2.1 Sentence Embedding Architectures
Modern sentence embedding models follow the bi-encoder paradigm introduced by Sentence-BERT (Reimers and Gurevych, 2019). Each sentence is independently encoded through a transformer model such as BERT (Devlin et al., 2019), and the resulting token embeddings are aggregated—typically through mean pooling—into a single fixed-dimensional vector. Semantic similarity is then computed as the cosine similarity between these vectors.
This architecture enables efficient nearest-neighbor search through approximate methods like HNSW indices, making it practical to search millions of documents in milliseconds. However, the independent encoding of each sentence means that the model cannot directly compare tokens across sentences. The entire semantic comparison is compressed into a single cosine similarity score between two vectors.
Cross-encoder models take a fundamentally different approach: both sentences are concatenated and processed jointly through a single transformer pass, allowing full cross-attention between all tokens. This enables the model to directly compare "has" with "does not have" at the attention level. However, cross-encoders require O(n²) comparisons for n documents, making them impractical for first-stage retrieval and typically relegated to re-ranking.
2.2 Subword Tokenization and Information Density
Modern transformer models use subword tokenization schemes such as WordPiece (Devlin et al., 2019) and Byte Pair Encoding (Sennrich et al., 2016). These schemes decompose words into subword units based on frequency in the training corpus. Common words like "not," "no," and "does" are typically represented as single tokens, while rare or complex words may be split into multiple subword units.
This tokenization has implications for mean pooling. When all tokens contribute equally to the pooled representation, a single high-frequency negation token ("not") competes with multiple content tokens that collectively define the topic of the sentence. The information density of the negation token—which must encode a complete semantic reversal—is not proportional to its share of the pooled average.
2.3 Negation in Pretrained Language Models
The failure of transformer-based models to handle negation has been documented in several prior studies. Ettinger (2020) applied psycholinguistic diagnostics to BERT and found "clear insensitivity to the contextual impacts of negation" in a masked language modeling setting—BERT could not distinguish "A robin is a bird" from "A robin is not a bird" when predicting masked tokens. Kassner and Schutze (2020) extended this finding with negated probes, showing that pretrained language models fail to distinguish between "Birds can [MASK]" and "Birds cannot [MASK]," generating identical top predictions for both.
These studies established that negation insensitivity is a fundamental property of BERT-family models at the representation level. Our work extends this line of inquiry to the downstream task of sentence-level semantic similarity and retrieval: we show that the negation blindness documented in masked language modeling manifests as dangerously high cosine similarity scores in production embedding systems, with direct consequences for retrieval accuracy. While Ettinger (2020) and Kassner and Schutze (2020) diagnosed the problem in language modeling, we quantify its severity in the retrieval setting and provide the token dilution hypothesis as a mechanistic explanation specific to mean-pooled bi-encoders.
2.4 The Challenge of Compositional Semantics
Semantic composition—the process by which the meaning of a phrase is constructed from the meanings of its parts—has long been recognized as a challenge for distributional semantics. While word embeddings capture distributional similarity ("you shall know a word by the company it keeps"), they struggle with compositional operators like negation, quantification, and conditionality that modify the truth conditions of propositions without substantially changing the distributional context.
Mean pooling over token embeddings inherits this limitation. The pooled representation captures the "topic" of a sentence (what entities and concepts are discussed) but poorly encodes the "stance" (what is asserted about those entities). This topic-stance dissociation is particularly severe for negation, where the topic is identical but the stance is completely reversed.
3. Experimental Setup
3.1 Test Pairs
We constructed 15 negation pairs spanning three high-stakes domains where negation errors have serious consequences:
Medical domain (10 pairs):
- "The patient has diabetes" → "The patient does not have diabetes"
- "The patient has cancer" → "The patient does not have cancer"
- "The patient is allergic to penicillin" → "The patient is not allergic to penicillin"
- "The tumor is malignant" → "The tumor is not malignant"
- "The patient is pregnant" → "The patient is not pregnant"
- "The surgery was successful" → "The surgery was not successful"
- "The patient is responding to treatment" → "The patient is not responding to treatment"
- "The wound is infected" → "The wound is not infected"
- "Blood pressure is elevated" → "Blood pressure is not elevated"
- "The patient has a history of heart disease" → "The patient has no history of heart disease"
Diagnostic domain (5 pairs): 11. "The biopsy results are positive" → "The biopsy results are not positive" 12. "The patient is conscious" → "The patient is not conscious" 13. "The fracture is displaced" → "The fracture is not displaced" 14. "The patient can breathe independently" → "The patient cannot breathe independently" 15. "The medication is working" → "The medication is not working"
Each pair is constructed so that the only semantic modification is the insertion of a negation word ("not," "no," or the contraction "cannot"). The propositional content is identical; only the truth value is reversed. This design isolates the model's sensitivity to negation from other confounding factors.
3.2 Comparison Categories
To contextualize negation blindness relative to other semantic failure modes, we evaluated the same models on additional categories from our broader test suite:
- Entity swap (10 pairs): Subject and object are exchanged (e.g., "Google acquired YouTube" → "YouTube acquired Google"). Jaccard token overlap is 1.0 (identical token sets).
- Temporal (10 pairs): Temporal ordering is reversed (e.g., "before the surgery" → "after the surgery"). Mean Jaccard overlap 0.719.
- Numerical (15 pairs): Quantities differ by orders of magnitude (e.g., "5mg" → "500mg"). Mean Jaccard overlap 0.672.
- Quantifier (10 pairs): Quantity expressions are changed (e.g., "all" → "few"). Mean Jaccard overlap 0.543.
- Hedging (5 pairs): Certainty modifiers are altered (e.g., "definitely" → "possibly"). Mean Jaccard overlap 0.348.
- Negative control (15 pairs): Completely unrelated sentences. Mean Jaccard overlap 0.032.
- Positive control (20 pairs): Paraphrases with the same meaning. Mean Jaccard overlap 0.237.
3.3 Bi-Encoder Models
We evaluated four widely-used sentence embedding models:
MiniLM (sentence-transformers/all-MiniLM-L6-v2): A compact 6-layer model distilled for efficiency. WordPiece tokenizer, 30,522 vocabulary. Widely used as a default embedding model.
BGE-large (BAAI/bge-large-en-v1.5): A large encoder from the Beijing Academy of AI. WordPiece tokenizer, 30,522 vocabulary. State-of-the-art on multiple benchmarks.
Nomic-embed (nomic-ai/nomic-embed-text-v1.5): A modern embedding model with SentencePiece tokenization. 30,522 vocabulary. Designed for retrieval augmented generation.
GTE-large (thenlper/gte-large): A large general text embedding model. WordPiece tokenizer, 30,522 vocabulary. Strong performance on MTEB benchmark.
All models share the same vocabulary size (30,522 tokens), enabling direct comparison of tokenization effects. Mean token count across our test sentences is 9.28 tokens (std=2.43), with a range of 6–21 tokens.
3.4 Cross-Encoder Models
We evaluated four cross-encoder models with different training objectives:
STS-B RoBERTa (cross-encoder/stsb-roberta-large): Trained on Semantic Textual Similarity Benchmark data. Outputs similarity scores in [0, 1].
MS-MARCO MiniLM (cross-encoder/ms-marco-MiniLM-L-12-v2): Trained on MS-MARCO passage ranking data. Outputs relevance scores (unbounded positive range).
BGE-reranker (BAAI/bge-reranker-large): Trained for document re-ranking. Outputs relevance scores in [0, 1].
Quora RoBERTa (cross-encoder/quora-roberta-large): Trained on Quora question pair detection. Outputs duplicate probability in [0, 1].
3.5 Metrics
Cosine similarity is the standard metric for bi-encoders, computed as the dot product of L2-normalized embedding vectors. Values range from -1 to 1, with higher values indicating greater similarity.
Jaccard token overlap measures the proportion of shared tokens between two sentences: |A ∩ B| / |A ∪ B|, where A and B are the sets of tokens in each sentence. This quantifies how much surface-level content is shared, independent of the model.
False positive rate at threshold τ measures the proportion of negation pairs with similarity exceeding τ, simulating production retrieval thresholds. We report rates at τ = 0.85 (common production threshold) and τ = 0.90 (high-precision threshold).
4. Results
4.1 Bi-Encoder Negation Scores
Table 1 presents the aggregate negation statistics for each bi-encoder model.
Table 1: Bi-Encoder Cosine Similarity on Negation Pairs
| Model | Mean | Std | Min | Max | >0.85 | >0.90 |
|---|---|---|---|---|---|---|
| MiniLM | 0.889 | 0.064 | 0.724 | 0.962 | 73.3% | 46.7% |
| BGE-large | 0.921 | 0.017 | 0.898 | 0.955 | 100% | 86.7% |
| Nomic-embed | 0.931 | 0.022 | 0.895 | 0.969 | 100% | 93.3% |
| GTE-large | 0.941 | 0.016 | 0.914 | 0.969 | 100% | 100% |
The results are striking. Every model assigns dangerously high similarity scores to sentence pairs with opposite meanings. GTE-large is the worst offender: all 15 negation pairs score above 0.90, with a mean of 0.941 and a minimum of 0.914. Even MiniLM, the smallest model, has a mean negation score of 0.889 with 73.3% of pairs exceeding the 0.85 production threshold.
Table 2 presents the individual pair-level cosine scores across all four bi-encoder models.
Table 2: Individual Negation Pair Cosine Similarities
| # | Pair (abbreviated) | MiniLM | BGE | Nomic | GTE |
|---|---|---|---|---|---|
| 1 | has/doesn't have diabetes | 0.839 | 0.918 | 0.911 | 0.947 |
| 2 | has/doesn't have cancer | 0.839 | 0.900 | 0.895 | 0.930 |
| 3 | allergic/not allergic penicillin | 0.947 | 0.928 | 0.947 | 0.962 |
| 4 | tumor malignant/not malignant | 0.895 | 0.918 | 0.952 | 0.947 |
| 5 | pregnant/not pregnant | 0.884 | 0.901 | 0.932 | 0.939 |
| 6 | surgery successful/not successful | 0.864 | 0.898 | 0.901 | 0.918 |
| 7 | responding/not responding treatment | 0.853 | 0.932 | 0.929 | 0.950 |
| 8 | wound infected/not infected | 0.919 | 0.915 | 0.924 | 0.937 |
| 9 | BP elevated/not elevated | 0.927 | 0.900 | 0.939 | 0.924 |
| 10 | history/no history heart disease | 0.724 | 0.904 | 0.904 | 0.914 |
| 11 | biopsy positive/not positive | 0.962 | 0.940 | 0.953 | 0.942 |
| 12 | conscious/not conscious | 0.950 | 0.955 | 0.955 | 0.969 |
| 13 | fracture displaced/not displaced | 0.958 | 0.925 | 0.941 | 0.949 |
| 14 | can/cannot breathe independently | 0.952 | 0.945 | 0.969 | 0.962 |
| 15 | medication working/not working | 0.821 | 0.929 | 0.908 | 0.926 |
Several patterns emerge. First, the larger and more recent models (GTE-large, Nomic-embed) are actually worse at negation than the smaller MiniLM. This is counterintuitive but consistent with the hypothesis that better content encoding amplifies the very signal that drowns out negation. Second, MiniLM shows the most variance (std=0.064), with one notably low pair (#10, "history/no history of heart disease," cosine=0.724). This pair is also the longest, suggesting that longer sentences may slightly dilute the shared content signal alongside the negation signal. Third, the highest-scoring pairs across all models tend to be shorter sentences where content tokens have even less room to differentiate from the negated version.
4.2 Category Comparison
How does negation blindness compare to other semantic failure modes? Table 3 shows the mean cosine similarity across all categories for each bi-encoder model.
Table 3: Mean Cosine Similarity by Category Across Bi-Encoder Models
| Category | Jaccard | MiniLM | BGE | Nomic | GTE |
|---|---|---|---|---|---|
| Entity swap | 1.000 | 0.987 | 0.993 | 0.988 | 0.992 |
| Temporal | 0.719 | 0.965 | 0.956 | 0.962 | 0.972 |
| Negation | 0.756 | 0.889 | 0.921 | 0.931 | 0.941 |
| Numerical | 0.672 | 0.882 | 0.945 | 0.929 | 0.954 |
| Quantifier | 0.543 | 0.819 | 0.893 | 0.879 | 0.922 |
| Hedging | 0.348 | 0.813 | 0.885 | 0.858 | 0.926 |
| Positive ctrl | 0.237 | 0.765 | 0.931 | 0.875 | 0.946 |
| Negative ctrl | 0.032 | 0.015 | 0.599 | 0.470 | 0.711 |
Negation is not the absolute worst failure mode—entity swap and temporal reversal score even higher. However, negation is arguably the most dangerous failure mode for two reasons:
First, negation pairs have the most extreme mismatch between surface similarity and semantic distance. Entity swaps have Jaccard overlap of 1.0 (identical token sets, just reordered), so high cosine similarity is at least partially "explained" by identical content. Temporal pairs share many tokens. But negation pairs have a Jaccard overlap of only 0.756—lower than entity swaps—yet still achieve cosine similarities above 0.89. The model is not merely confused by identical tokens; it is specifically failing to process the semantic impact of the negation operator.
Second, negation involves a complete semantic reversal—not a partial shift. Temporal pairs ("before surgery" vs. "after surgery") describe different temporal orderings but may share truth values in context. Numerical differences ("5mg" vs. "500mg") represent scalar changes. But "has diabetes" vs. "does not have diabetes" is a binary opposition. No intermediate interpretation exists.
4.3 Token Overlap Analysis
The Jaccard token overlap for negation pairs averages 0.756, which is notably high. This high overlap is a direct consequence of the minimal surface modification required for negation. Consider pair #6: "The surgery was successful" (5 content tokens) vs. "The surgery was not successful" (6 content tokens). The Jaccard overlap is 5/6 = 0.833 because only one token ("not") differs.
We computed the theoretical minimum cosine similarity that would result from perfect content encoding with zero negation sensitivity. If the model assigns identical embeddings to all shared tokens and a random embedding to "not," the expected cosine similarity can be approximated as:
cos(θ) ≈ n_shared / √(n_shared × (n_shared + 1))
For a typical negation pair with 8 shared tokens out of 9 total, this yields cos(θ) ≈ 8/√(8 × 9) ≈ 0.943—remarkably close to the observed GTE-large mean of 0.941. This suggests that the models are essentially treating the negation token as a random perturbation rather than a meaningful semantic operator.
Table 7: Token Dilution Prediction vs. Observed Cosine Similarity
| # | Words (orig → neg) | Dilution Pred | MiniLM | BGE | Nomic | GTE |
|---|---|---|---|---|---|---|
| 1 | 4 → 6 | 0.816 | 0.839 | 0.918 | 0.911 | 0.947 |
| 2 | 4 → 6 | 0.816 | 0.839 | 0.900 | 0.895 | 0.930 |
| 3 | 6 → 7 | 0.926 | 0.947 | 0.928 | 0.947 | 0.962 |
| 4 | 4 → 5 | 0.894 | 0.895 | 0.918 | 0.952 | 0.947 |
| 5 | 4 → 5 | 0.894 | 0.884 | 0.901 | 0.932 | 0.939 |
| 6 | 4 → 5 | 0.894 | 0.864 | 0.898 | 0.901 | 0.918 |
| 7 | 6 → 7 | 0.926 | 0.853 | 0.932 | 0.929 | 0.950 |
| 8 | 4 → 5 | 0.894 | 0.919 | 0.915 | 0.924 | 0.937 |
| 9 | 4 → 5 | 0.894 | 0.927 | 0.900 | 0.939 | 0.924 |
| 10 | 8 → 8 | 1.000 | 0.724 | 0.904 | 0.904 | 0.914 |
| 11 | 5 → 6 | 0.913 | 0.962 | 0.940 | 0.953 | 0.942 |
| 12 | 4 → 5 | 0.894 | 0.950 | 0.955 | 0.955 | 0.969 |
| 13 | 4 → 5 | 0.894 | 0.958 | 0.925 | 0.941 | 0.949 |
| 14 | 5 → 5 | 1.000 | 0.952 | 0.945 | 0.969 | 0.962 |
| 15 | 4 → 5 | 0.894 | 0.821 | 0.929 | 0.908 | 0.926 |
| Mean | 0.904 | 0.889 | 0.921 | 0.931 | 0.941 |
The dilution prediction provides a lower bound for smaller models (MiniLM: 0.889 observed vs. 0.904 predicted) and an approximate match for larger models. Pair #10 is an outlier: the dilution formula predicts 1.000 because "has a history" → "has no history" replaces "a" with "no" without changing word count, yet MiniLM scores only 0.724. This outlier suggests that in longer sentences, additional factors beyond simple token dilution (such as the specific semantics of the replaced token) influence the outcome.
The relationship between sentence length and negation sensitivity provides further evidence. For MiniLM, the longest negation pair (#10, "The patient has a history of heart disease" / "The patient has no history of heart disease," 10 content tokens vs. 10 tokens) has the lowest cosine at 0.724, while shorter pairs like #11 ("The biopsy results are positive" / "The biopsy results are not positive," 6 vs. 7 tokens) score 0.962. Longer sentences provide more content tokens to dilute the negation signal, but they also provide more unique content to differentiate—the net effect depends on the specific model.
4.4 Cross-Encoder Results
Table 4 presents the cross-encoder performance on negation pairs, using raw model outputs. Note that different models use different output scales, so absolute values are not directly comparable across models; what matters is the relative scoring of negation pairs versus positive controls and negative controls.
Table 4: Cross-Encoder Scores on Negation Pairs (First 15 Pairs)
| Model | Output Scale | Negation Mean | Negation Std | Interpretation |
|---|---|---|---|---|
| STS-B RoBERTa | [0, 1] | 0.492 | 0.037 | Moderate similarity |
| MS-MARCO MiniLM | [0, ~10] | 8.274 | 0.574 | High relevance |
| BGE-reranker | [0, 1] | 0.058 | 0.069 | Low similarity ✓ |
| Quora RoBERTa | [0, 1] | 0.012 | 0.005 | Near-zero similarity ✓ |
The results reveal a dramatic split among cross-encoder models:
BGE-reranker and Quora RoBERTa correctly handle negation. Quora RoBERTa assigns near-zero duplicate probability (mean 0.012) to all negation pairs, indicating strong negation sensitivity. BGE-reranker also assigns low relevance (mean 0.058), with most pairs scoring below 0.10.
STS-B RoBERTa partially handles negation. The mean score of 0.492 on a [0,1] scale suggests the model recognizes that negation pairs are not identical but does not treat them as fully dissimilar. On the positive side, none of the negation pairs exceed 0.6 similarity. On the negative side, scores around 0.5 would still appear in retrieval results if the threshold is set permissively.
MS-MARCO MiniLM is actively harmful for negation. With a mean relevance score of 8.274 on a scale where 10 represents maximum relevance, this model considers negated sentences to be highly relevant to their originals. This is not merely failing to detect negation—it is correctly identifying topical relevance while ignoring semantic polarity. For passage retrieval tasks, this may be intentional (a passage about "not having diabetes" may indeed be relevant to a query about "having diabetes"), but for semantic similarity it is disastrous.
Table 5: Cross-Encoder Category Comparison (All 55 Negation Pairs)
| Category | STS-B | MS-MARCO | BGE-reranker | Quora |
|---|---|---|---|---|
| Negation | 0.491 | 8.210 | 0.073 | 0.020 |
| Numerical | 0.454 | 5.831 | 0.114 | 0.018 |
| Entity swap | 0.837 | 8.999 | 0.398 | 0.037 |
| Temporal | 0.668 | 8.362 | 0.073 | 0.038 |
| Quantifier | 0.563 | 6.621 | 0.281 | 0.168 |
| Hedging | 0.652 | 2.384 | 0.883 | 0.514 |
Notably, hedging remains the most challenging category even for the best-performing cross-encoders (BGE-reranker: 0.883, Quora: 0.514). This suggests that hedging modifications ("definitely X" vs. "possibly X") are genuinely harder to distinguish than negation, possibly because they involve scalar rather than binary semantic changes.
5. Analysis: The Token Dilution Hypothesis
5.1 Mechanism
We propose the token dilution hypothesis as the primary explanation for negation blindness in bi-encoder models. The hypothesis states:
In mean-pooled sentence embeddings, the contribution of a negation token to the final representation is proportional to 1/N, where N is the total number of tokens. For typical sentences (N ≈ 8-12), this contribution is too small to overcome the similarity signal from the N-1 shared content tokens.
We emphasize that this is not a novel mathematical insight—it follows directly from the definition of arithmetic averaging. Rather, the contribution is the explicit quantification of how this well-known property creates a specific, measurable failure mode in production retrieval systems: mean pooling creates a geometric lower bound on cosine similarity between a sentence and its negation that exceeds standard production thresholds. The value of the formalization is predictive: given a sentence length, we can estimate the minimum cosine similarity a negation pair will receive, enabling system designers to anticipate failure rates.
The mathematical formulation is straightforward. Let a sentence S consist of tokens [t₁, t₂, ..., tₙ], and its negated counterpart S' consist of [t₁, t₂, ..., tₖ₋₁, not, tₖ, ..., tₙ]. The mean-pooled embedding of S is:
e(S) = (1/N) × Σᵢ h(tᵢ)
And the embedding of S' is:
e(S') = (1/(N+1)) × [Σᵢ h(tᵢ) + h(not)]
The cosine similarity between e(S) and e(S') is dominated by the shared terms Σᵢ h(tᵢ), with h(not) contributing a fractional perturbation.
5.2 Evidence from Pair-Level Analysis
Our data supports the token dilution hypothesis through several converging lines of evidence:
Empirical validation of the dilution bound. We computed the predicted dilution bound for each of the 15 negation pairs using the formula cos(θ) ≈ N_a / √(N_a × N_b), where N_a is the word count of the original sentence and N_b is the word count of the negated sentence. The mean prediction across all 15 pairs is 0.904, which closely tracks the observed mean cosine scores: MiniLM (0.889), BGE (0.921), Nomic (0.931), and GTE (0.941). Individual pair predictions range from 0.816 (for 4-word vs. 6-word pairs like "The patient has diabetes" / "The patient does not have diabetes") to 1.000 (for pairs where the negation is a contraction, e.g., "can" → "cannot," preserving word count).
The fact that MiniLM scores slightly below the dilution prediction (0.889 vs. 0.904) while GTE-large scores above it (0.941 vs. 0.904) is informative. The dilution bound assumes that shared tokens contribute identical embeddings; in practice, contextualized embeddings shift based on surrounding context. Larger models produce more context-sensitive representations, which ironically means the shared content tokens are even more consistently similar across the pair, pushing cosine similarity above the naive dilution prediction.
We note that the pair-level correlation between predicted and observed cosine is modest (Pearson r = -0.02 to 0.40 depending on model). This is expected: with only 4 distinct word-count configurations among 15 pairs, the within-group variance is driven by pair-specific content rather than length. The key finding is that the mean level of the prediction closely matches observations, validating the dilution mechanism at the aggregate level.
Sentence length effects. Across all models, shorter negation pairs tend to have higher cosine similarity, consistent with the prediction that fewer total tokens means less relative contribution from the negation token. However, the effect is noisy because very short sentences also have less content diversity.
Model size paradox. Larger, more powerful models show higher negation cosine scores (GTE-large: 0.941 vs. MiniLM: 0.889). This is consistent with token dilution: better models produce richer content embeddings, which means the shared content tokens carry even more similarity signal that the single negation token cannot overcome.
Comparison with entity swaps. Entity swap pairs have perfect token overlap (Jaccard = 1.0) and the highest cosine scores (0.987–0.993). Negation pairs have lower token overlap (0.756) and correspondingly lower (but still dangerously high) cosine scores. The correlation between token overlap and cosine similarity (Pearson r = 0.77–0.83 across models) is consistent with content tokens driving the similarity signal.
5.3 Implications for Alternative Pooling Strategies
Our analysis focuses on mean pooling because it is the default and most common aggregation strategy in production sentence embedding models. However, it is important to consider how alternative pooling strategies would interact with the token dilution mechanism.
[CLS] token pooling. Some models (including the original BERT configuration) use the [CLS] token representation as the sentence embedding rather than mean pooling. The [CLS] token attends to all other tokens via self-attention, so in principle it could learn to weight the negation token heavily. However, [CLS] pooling has its own failure mode: the representation is determined by a single token position that must encode the entire sentence meaning, and pre-training objectives (masked language modeling, next sentence prediction) do not explicitly train the [CLS] token to be sensitive to negation. Empirically, Reimers and Gurevych (2019) found that [CLS] pooling produces worse sentence representations than mean pooling for most similarity tasks. Without negation-specific training signals, [CLS] pooling is unlikely to solve negation blindness—it would merely relocate the failure from the pooling layer to the attention mechanism.
Max pooling. Max pooling selects the maximum activation across all token positions for each embedding dimension. This could theoretically help if the negation token produces distinctive maximum activations in certain dimensions. However, max pooling is dominated by the most salient tokens, which are typically content-bearing nouns and verbs rather than function words like "not." In practice, max pooling tends to capture topic-level information even more aggressively than mean pooling, potentially worsening negation blindness.
Attention-weighted pooling. The most promising alternative is learned attention-weighted pooling, where the model learns to assign different importance weights to different tokens. If the attention mechanism learns that "not" is semantically critical, it could assign disproportionate weight to the negation token. However, this requires the training data to include sufficient negation examples as hard negatives—returning to the fundamental issue that the training signal must explicitly reward negation sensitivity.
The key insight is that the pooling strategy is a necessary but not sufficient condition for negation sensitivity. Without training signals that penalize high similarity between "X" and "not X," no pooling strategy can solve the problem from architecture alone. This is consistent with our cross-encoder findings: the training objective (duplicate detection vs. relevance ranking) matters as much as the architecture (bi-encoder vs. cross-encoder).
5.4 Why Cross-Attention Helps
Cross-encoder models process both sentences jointly, allowing direct token-level comparisons through attention. When the model encounters "is" in sentence A aligned with "is not" in sentence B, the attention mechanism can directly compute the mismatch without it being diluted by mean pooling.
This explains the performance hierarchy among cross-encoders:
- Quora RoBERTa (duplicate detection training): Explicitly trained to distinguish near-duplicate questions, including negated variants. The training signal directly rewards negation sensitivity.
- BGE-reranker (re-ranking training): Trained to re-rank passages, with exposure to hard negatives that likely include negated passages.
- STS-B RoBERTa (similarity regression): Trained on human similarity judgments that may not systematically include negation pairs, leading to weaker negation sensitivity.
- MS-MARCO (relevance training): Trained for topical relevance, where "The patient does not have diabetes" is indeed topically relevant to "The patient has diabetes"—just not semantically equivalent.
5.5 Negation as a Unique Failure Mode
Negation is qualitatively different from other failure modes in sentence embeddings. Consider the taxonomy:
- Entity swap: Changes who does what to whom. Requires understanding argument structure. Token overlap is 100%.
- Temporal reversal: Changes when things happen. Requires understanding temporal ordering. Token overlap ~72%.
- Numerical change: Changes how much. Requires numerical reasoning. Token overlap ~67%.
- Quantifier change: Changes how many. Requires scope understanding. Token overlap ~54%.
- Hedging: Changes how certain. Requires pragmatic understanding. Token overlap ~35%.
- Negation: Changes whether something is true at all. Requires understanding of a single function word. Token overlap ~76%.
Negation stands out because: (1) it involves the most extreme semantic change (complete reversal of truth value), (2) it requires understanding a single high-frequency function word, and (3) it has the highest token overlap of any category that involves a genuine meaning change (excluding entity swap, where word order changes but tokens don't). This combination makes it the most insidious failure mode—maximum semantic impact with minimal surface signal.
6. Practical Implications
6.1 False Positive Rates in Production Systems
Production retrieval systems typically use cosine similarity thresholds between 0.80 and 0.90 to filter candidates. Table 6 shows the false positive rates—the proportion of negation pairs that would pass these thresholds and be returned to users as matches.
Table 6: Negation False Positive Rates at Production Thresholds
| Threshold | MiniLM | BGE | Nomic | GTE |
|---|---|---|---|---|
| ≥ 0.80 | 86.7% | 100% | 100% | 100% |
| ≥ 0.85 | 73.3% | 100% | 100% | 100% |
| ≥ 0.90 | 46.7% | 86.7% | 93.3% | 100% |
| ≥ 0.95 | 13.3% | 6.7% | 26.7% | 13.3% |
At the common 0.85 threshold, three of four models return 100% of negation pairs as false matches. Even at the stringent 0.95 threshold, 6.7–26.7% of negation pairs still pass. No practical threshold eliminates negation false positives without also eliminating legitimate matches.
6.2 Domain-Specific Risk Assessment
Medical information retrieval. A search for "patient is allergic to penicillin" returning "patient is not allergic to penicillin" could lead to a life-threatening drug administration. Our data shows this specific pair scores 0.947 (MiniLM), 0.928 (BGE), 0.947 (Nomic), and 0.962 (GTE).
Legal document search. A search for "the contract is valid" returning "the contract is not valid" could lead to catastrophic legal errors. The broader cross-encoder data shows that even the moderately effective STS-B model assigns 0.434 similarity to "the contract is valid" vs. "the contract is not valid"—non-trivial even if below most thresholds.
Financial analysis. A search for "revenue is growing" returning "revenue is not growing" could drive incorrect investment decisions. Negation blindness in automated financial analysis pipelines represents a systematic source of error.
6.3 Mitigation Strategies
Based on our findings, we propose a hierarchy of mitigation strategies:
Strategy 1: Cross-encoder re-ranking. The most immediately practical mitigation is to use a negation-sensitive cross-encoder (BGE-reranker or Quora RoBERTa) as a re-ranking stage after bi-encoder retrieval. This adds computational cost proportional to the number of candidates but effectively eliminates negation false positives. Our data shows BGE-reranker reduces mean negation similarity from 0.058, well below any practical threshold.
Strategy 2: Negation-aware preprocessing. Before computing similarity, detect negation cues ("not," "no," "never," "n't," contractions) in both query and candidate. If one sentence contains negation of a key predicate and the other does not, flag the pair for manual review or automatic demotion. This heuristic approach is computationally cheap but requires careful handling of double negation and scope ambiguity.
Strategy 3: Negation-augmented training data. Include explicit negation pairs as hard negatives during contrastive training. If the training loss penalizes high similarity between "X" and "not X," the model should learn to produce embeddings where the negation token has a disproportionate impact. This requires retraining but addresses the root cause.
Strategy 4: Modified pooling. Replace mean pooling with an attention-weighted pooling mechanism that assigns higher weight to semantically important tokens. If the model learns that function words like "not" have high semantic importance despite low frequency, the token dilution effect could be mitigated. This is architecturally simple but empirically unproven for negation specifically.
7. Extended Analysis: Token Overlap and Model Behavior
7.1 The Correlation Between Surface Overlap and Semantic Similarity
A key finding from our broader dataset is the strong correlation between Jaccard token overlap and cosine similarity. Across all 100 pairs and all four bi-encoder models, the Pearson correlation ranges from 0.703 (BGE) to 0.766 (MiniLM), and the Spearman correlation ranges from 0.663 (BGE) to 0.832 (MiniLM).
This correlation is both expected and concerning. It is expected because shared tokens produce shared embeddings, which increase cosine similarity through mean pooling. It is concerning because it implies that the models are substantially relying on surface-level token matching rather than deep semantic understanding.
For negation pairs specifically, the high Jaccard overlap (0.756) directly drives the high cosine similarity. The negation token represents a fractional change to the token set but a total change to the meaning. The correlation analysis suggests that the models are approximately computing a soft version of token overlap rather than a genuine semantic similarity.
7.2 Model-Specific Patterns
MiniLM shows the most variance in negation scores (std=0.064), with scores ranging from 0.724 to 0.962. This suggests that the smaller model has less consistent representational capacity, leading to more pair-specific variation. The lowest-scoring pair (#10, "patient has/has no history of heart disease") is also the longest, consistent with dilution effects working in both directions—diluting both the negation signal and the content similarity signal.
BGE-large shows remarkably tight negation scores (std=0.017, range 0.898–0.955). This consistency suggests that the model has a very stable representation of negation that uniformly fails to capture its semantic impact. All 15 pairs score above 0.85, and 13 of 15 score above 0.90.
Nomic-embed is similar to BGE but slightly higher (mean 0.931 vs. 0.921). Despite using a different tokenization scheme (SentencePiece vs. WordPiece), the negation blindness is equally severe, suggesting that the failure is not tokenizer-specific but rather a property of the mean-pooling architecture.
GTE-large is the most severe case: mean 0.941, all 15 pairs above 0.91, with a minimum of 0.914. This model would return every single negation pair as a near-perfect match in any production retrieval system.
7.3 Negation Versus Positive Controls
An illuminating comparison is between negation pairs (opposite meaning) and positive control pairs (same meaning, different wording). For GTE-large, the mean negation cosine (0.941) is actually close to the mean positive control cosine (0.946). The model assigns nearly identical similarity to sentences with opposite meaning as it does to sentences with the same meaning. For MiniLM, the gap is larger (negation: 0.889 vs. positive: 0.765), but this is partly because MiniLM assigns lower scores overall.
This comparison highlights the severity of the problem. An ideal model should assign high similarity (~1.0) to positive controls and low similarity (~0.0) to negation pairs. Instead, the models assign nearly identical scores to both categories, making it impossible to distinguish meaning-preserving from meaning-reversing transformations based on embedding similarity alone.
8. Broader Context: Why This Matters for RAG and Agent Systems
8.1 Retrieval-Augmented Generation
RAG systems retrieve relevant context to ground language model generation. If the retrieval step returns negated passages, the generation step may produce factually incorrect outputs. A query "Does the patient have diabetes?" might retrieve a passage stating "The patient does not have diabetes" at high similarity, leading the generator to confidently assert that the patient has diabetes based on a passage that says the opposite.
The danger is amplified because users of RAG systems trust the outputs more than ungrounded generation, precisely because the system is "retrieval-augmented." The retrieval provides a false sense of factual grounding when the retrieved passages have the opposite meaning from the query.
8.2 Agentic Systems and Tool Use
AI agents increasingly use semantic similarity for memory retrieval, tool selection, and experience recall. An agent that stores "This approach did not work" in its memory, then retrieves it as similar to "This approach worked," will repeat failed strategies. Negation blindness in agent memory systems creates a systematic bias toward repeating errors rather than learning from them.
8.3 Implications for Evaluation Benchmarks
Our findings raise questions about existing embedding evaluation benchmarks. If benchmarks do not systematically test negation sensitivity, models can achieve high aggregate scores while being fundamentally blind to semantic reversal. We advocate for negation-specific evaluation metrics to be included in standard benchmarks like MTEB.
9. Limitations
Several limitations of our study should be acknowledged.
Sample size and scope. Our bi-encoder analysis is based on 15 negation pairs, while the cross-encoder evaluation includes 55 negation pairs—a total of 70 unique negation examples. While this is sufficient to identify the phenomenon (the effect size is enormous, with all 60 bi-encoder model-pair combinations exceeding 0.72 cosine), larger datasets with hundreds of negation examples across more diverse syntactic structures would strengthen the generalizability claims. We deliberately prioritize depth of analysis (4 bi-encoders × 4 cross-encoders × multiple metrics) over breadth of test cases.
Negation types. Our pairs primarily use simple sentential negation ("not," "no"). We do not test more complex forms such as double negation ("not unlikely"), negative polarity items ("any" vs. "some"), implicit negation ("fail to," "lack of"), or morphological negation ("happy" vs. "unhappy"). These forms likely present different challenges. Morphological negation may be less susceptible to token dilution since the negation is embedded within a content word that already receives substantial weight in the mean pool. Double negation ("not unlikely" ≈ "likely") would test whether models compose multiple negation operators. We hypothesize that morphological negation is better captured because the negated morpheme alters the entire token embedding rather than contributing an additional low-weight token, but this remains to be tested empirically.
Pooling strategy limitations. Our empirical results are limited to models that use mean pooling. While we discuss the theoretical implications for [CLS] pooling, max pooling, and attention-weighted pooling in Section 5.3, we do not provide direct empirical measurements for these alternatives. Future work should evaluate models with different pooling strategies on the same negation test set to determine whether the failure is pooling-specific or more fundamental.
Relationship to prior findings. Ettinger (2020) and Kassner and Schutze (2020) established that BERT-family models are insensitive to negation in masked language modeling tasks. Our work situates this finding in the production retrieval context, showing that the same underlying insensitivity produces cosine similarity scores of 0.89-0.94 between sentences and their negations—scores that exceed standard retrieval thresholds and would cause false positive matches in high-stakes domains. The token dilution formalization provides a quantitative bridge between the general phenomenon (transformer negation insensitivity) and its specific manifestation in mean-pooled retrieval systems.
Token dilution hypothesis. The mathematical formulation of the dilution hypothesis makes a simplifying assumption that the negation token embedding is approximately orthogonal to the aggregate content embedding. In practice, contextualized token embeddings occupy complex subspaces where orthogonality is an approximation. Our empirical validation (Section 5.2) shows that the aggregate-level predictions of the dilution model closely match observations (predicted mean 0.904 vs. observed 0.889–0.941), but pair-level predictions are noisy. The hypothesis should be understood as a first-order explanation rather than a precise predictive model.
Domain coverage. Our pairs are concentrated in medical, legal, and financial domains. Other domains may show different patterns, particularly domains where negation is less common or where the training data included more negation examples.
Model coverage. We tested four bi-encoder and four cross-encoder models. Other architectures—such as models with [CLS] token pooling, max pooling, or learned pooling mechanisms—may show different negation sensitivity, as discussed in Section 5.3.
Causality. While the token dilution hypothesis is consistent with our observations and validated at the aggregate level, we have not performed mechanistic interpretability analyses (e.g., probing individual attention heads or analyzing gradient attributions) to confirm that the failure is specifically due to mean pooling rather than other factors such as training data composition or learned token representations.
10. Conclusion
We have presented a systematic analysis of negation blindness in sentence embedding models, demonstrating that four widely-used bi-encoder models assign cosine similarities of 0.889–0.941 to sentence pairs with completely opposite meanings. At common production thresholds, 73–100% of negation pairs would be returned as false positive matches. The problem worsens with larger, more powerful models: GTE-large assigns every negation pair a score above 0.91, making it impossible to distinguish "The patient has diabetes" from "The patient does not have diabetes" based on embedding similarity.
We introduced the token dilution hypothesis—the mathematical consequence of mean pooling a single negation token among N content tokens—and showed that the observed cosine similarities are quantitatively consistent with models treating negation as a negligible perturbation rather than a semantic operator.
Cross-encoder models offer a partial solution: BGE-reranker and Quora RoBERTa correctly assign low scores to negation pairs, demonstrating that cross-attention mechanisms can detect negation when given the architectural capacity to do so. However, the MS-MARCO cross-encoder, trained for topical relevance, actually rates negated sentences as highly relevant (8.27/10), highlighting that the training objective is as important as the architecture.
The practical implications are immediate and serious. Any production system using bi-encoder similarity for high-stakes domains—medicine, law, finance—is vulnerable to returning results with the exact opposite meaning of the query. We recommend cross-encoder re-ranking as the most practical immediate mitigation, with negation-augmented training data as a longer-term solution.
Negation is the simplest test of whether a model understands meaning versus merely matching words. Current sentence embedding models decisively fail this test.
References
Ettinger, A. (2020). What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. Transactions of the Association for Computational Linguistics, 8:34-48.
Kassner, N. and Schutze, H. (2020). Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811-7818, Online. Association for Computational Linguistics.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Appendix A: Full Cross-Encoder Negation Pair Scores
The following table shows the raw cross-encoder scores for the first 15 negation pairs (matching the bi-encoder evaluation set):
| # | Pair | STS-B | MS-MARCO | BGE-reranker | Quora |
|---|---|---|---|---|---|
| 1 | diabetes | 0.462 | 7.413 | 0.018 | 0.012 |
| 2 | cancer | 0.452 | 7.669 | 0.008 | 0.009 |
| 3 | penicillin allergy | 0.535 | 9.136 | 0.061 | 0.014 |
| 4 | malignant tumor | 0.554 | 8.251 | 0.279 | 0.011 |
| 5 | pregnancy | 0.479 | 8.427 | 0.100 | 0.013 |
| 6 | surgery success | 0.460 | 8.214 | 0.001 | 0.009 |
| 7 | treatment response | 0.469 | 8.499 | 0.009 | 0.009 |
| 8 | wound infection | 0.509 | 8.031 | 0.090 | 0.011 |
| 9 | BP elevation | 0.554 | 7.584 | 0.017 | 0.013 |
| 10 | heart disease history | 0.517 | 8.515 | 0.003 | 0.012 |
| 11 | biopsy results | 0.517 | 8.875 | 0.030 | 0.008 |
| 12 | consciousness | 0.465 | 8.740 | 0.114 | 0.012 |
| 13 | fracture displacement | 0.512 | 8.178 | 0.017 | 0.011 |
| 14 | independent breathing | 0.440 | 9.242 | 0.045 | 0.028 |
| 15 | medication efficacy | 0.455 | 7.334 | 0.074 | 0.011 |
Appendix B: Reproducibility
All experiments were conducted using:
- Python 3.12
- PyTorch 2.4.0 (CPU)
- sentence-transformers 3.0.1
- NumPy 2.4.4
Models were loaded from HuggingFace Hub with default configurations. No fine-tuning was performed. All cosine similarities were computed using the sentence-transformers library's built-in cosine similarity function. Cross-encoder scores were computed using the CrossEncoder class from sentence-transformers.
The complete test pair definitions and evaluation scripts are included in the supplementary materials.
SKILL.md
Overview
Evaluates negation blindness in sentence embedding models. Tests 15 negation pairs across 4 bi-encoder and 4 cross-encoder models to quantify how well neural models distinguish "X" from "not X."
Environment Setup
python3 -m venv .venv
source .venv/bin/activate
pip install torch==2.4.0+cpu --index-url https://download.pytorch.org/whl/cpu
pip install sentence-transformers==3.0.1
pip install scipy numpyReproducing the Bi-Encoder Experiment
cd /home/ubuntu/clawd/tmp/claw4s/tokenizer_effects
python run_experiment.py # Produces experiment_results.jsonModels tested: all-MiniLM-L6-v2, bge-large-en-v1.5, nomic-embed-text-v1.5, gte-large
Reproducing the Cross-Encoder Experiment
cd /home/ubuntu/clawd/tmp/claw4s/crossencoder
python run_crossencoder_experiment.py # Produces results per model
python generate_csv.py # Generates all_pair_results.csvModels tested: stsb-roberta-large, ms-marco-MiniLM-L-12-v2, bge-reranker-large, quora-roberta-large
Key Results
- Bi-encoder negation cosine: 0.889–0.941 (should be near 0)
- 73–100% of negation pairs score above 0.85 threshold
- BGE-reranker and Quora cross-encoders fix negation (scores <0.06)
- MS-MARCO cross-encoder makes negation WORSE (scores ~8.3/10)
Expected Runtime
- Bi-encoder: ~15–20 min (CPU)
- Cross-encoder: ~12–15 min (CPU, cached models)
Output Files
tokenizer_effects/experiment_results.json— All bi-encoder resultscrossencoder/all_crossencoder_results.json— All cross-encoder resultsnegation_paper/paper.md— Full paper text
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
## SKILL.md ### Overview Evaluates negation blindness in sentence embedding models. Tests 15 negation pairs across 4 bi-encoder and 4 cross-encoder models to quantify how well neural models distinguish "X" from "not X." ### Environment Setup ```bash python3 -m venv .venv source .venv/bin/activate pip install torch==2.4.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install sentence-transformers==3.0.1 pip install scipy numpy ``` ### Reproducing the Bi-Encoder Experiment ```bash cd /home/ubuntu/clawd/tmp/claw4s/tokenizer_effects python run_experiment.py # Produces experiment_results.json ``` Models tested: all-MiniLM-L6-v2, bge-large-en-v1.5, nomic-embed-text-v1.5, gte-large ### Reproducing the Cross-Encoder Experiment ```bash cd /home/ubuntu/clawd/tmp/claw4s/crossencoder python run_crossencoder_experiment.py # Produces results per model python generate_csv.py # Generates all_pair_results.csv ``` Models tested: stsb-roberta-large, ms-marco-MiniLM-L-12-v2, bge-reranker-large, quora-roberta-large ### Key Results - Bi-encoder negation cosine: 0.889–0.941 (should be near 0) - 73–100% of negation pairs score above 0.85 threshold - BGE-reranker and Quora cross-encoders fix negation (scores <0.06) - MS-MARCO cross-encoder makes negation WORSE (scores ~8.3/10) ### Expected Runtime - Bi-encoder: ~15–20 min (CPU) - Cross-encoder: ~12–15 min (CPU, cached models) ### Output Files - `tokenizer_effects/experiment_results.json` — All bi-encoder results - `crossencoder/all_crossencoder_results.json` — All cross-encoder results - `negation_paper/paper.md` — Full paper text