When Cosine Similarity Lies: Systematic Failure Modes and Mechanisms in Production Embedding Models
When Cosine Similarity Lies: Systematic Failure Modes and Mechanisms in Production Embedding Models
Abstract
Embedding models underpin modern retrieval-augmented generation (RAG), semantic search, and recommendation systems. We present a systematic evaluation of six failure modes across five widely-deployed bi-encoder embedding models — all-MiniLM-L6-v2, BGE-large-en-v1.5, Nomic-embed-text-v1.5, mxbai-embed-large-v1, and GTE-large — using 286 manually-crafted adversarial sentence pairs and 85 control pairs (371 pairs total). We demonstrate that all tested models exhibit catastrophic failures in distinguishing semantically opposite or critically different sentences. The most severe failure mode is entity/role swapping (cross-model mean cosine similarity 0.987 ± 0.003, higher than true paraphrases at 0.878 ± 0.072), followed by temporal inversion (0.953 ± 0.020) and negation blindness (0.896 ± 0.042). At the standard retrieval threshold of 0.7, 100% of entity-swapped and temporally-inverted pairs would be retrieved as identical across all models; at a strict threshold of 0.9, entity swaps still show a 100% failure rate across every model. Beyond failure characterization, we provide a mechanistic analysis using token-level embedding decomposition: individual token embeddings at swapped positions show low similarity (0.596–0.861), proving the transformer attention layers do encode positional and role information. However, mean pooling — the standard aggregation strategy — averages these discriminative signals with the majority of identical-position tokens, collapsing role-reversal information into a similarity of 0.925+. We further evaluate four cross-encoder models on the same test pairs to assess whether cross-attention architectures mitigate these failures. We find that well-chosen cross-encoders (Quora-RoBERTa, BGE-reranker) dramatically resolve negation, numerical, and temporal failures, reducing failure rates from 100% to 0%. However, cross-encoders trained on topical relevance objectives (MS-MARCO) fail even worse than bi-encoders, and hedging/certainty remains challenging for all architectures. We provide multi-threshold failure rate analysis, per-pair similarity distributions with confidence intervals, a power analysis confirming statistical adequacy (Cohen's d > 2.0 for entity swaps, power > 0.99), and concrete risk scenarios in medical, legal, and financial retrieval. Our complete test suite of 371 sentence pairs and reproduction code are publicly available.
1. Introduction
Text embedding models have become ubiquitous in modern NLP infrastructure. When a user queries a RAG system, when a search engine ranks results, or when a recommendation engine suggests content, embedding models and cosine similarity are typically the core mechanism. Models like all-MiniLM-L6-v2 (over 100 million downloads on HuggingFace) and BGE-large-en-v1.5 (a popular choice for production RAG pipelines) are deployed at enormous scale, processing millions of queries daily across healthcare, legal, financial, and consumer applications.
The fundamental assumption underlying these deployments is that cosine similarity between embeddings faithfully represents semantic similarity — that sentences with similar meanings will have similar embeddings, and sentences with different meanings will have different embeddings. This assumption is rarely tested systematically, and when it is, the evaluation typically focuses on whether paraphrases score highly, not on whether semantically opposite sentences score distinctly.
In this work, we challenge this assumption by testing five production embedding models across six carefully-designed failure modes: negation blindness, numerical insensitivity, entity/role swaps, temporal inversion, scope/quantifier sensitivity, and hedging/certainty confusion. Beyond documenting failures, we provide a mechanistic analysis using token-level embedding decomposition that explains why these failures occur and points toward architectural solutions.
Our key contributions are:
- A systematic benchmark of 286 adversarial pairs spanning six failure modes, with 85 control pairs (positive, negative, and near-miss), all manually crafted — 371 pairs total.
- Evidence that all five tested models catastrophically fail at distinguishing semantically opposite sentences, with entity/role swaps producing similarity scores higher than actual paraphrases across all five models.
- Multi-threshold failure rate analysis at thresholds from 0.5 to 0.9, with per-pair similarity distributions including mean, standard deviation, IQR, and min/max values, demonstrating that these are robust distributional findings, not cherry-picked results.
- Mechanistic explanation via token-level analysis: We decompose sentence embeddings into per-position token embeddings and show that transformer attention layers do encode positional and role information (swapped-entity tokens show only 0.596–0.725 similarity), but mean pooling erases this information by averaging with the majority of position-invariant tokens.
- Five-model evaluation spanning four architectures: MiniLM (22M params), BGE-large (335M params, CLS pooling), Nomic-embed (137M params, rotary embeddings), mxbai-embed-large (335M params), and GTE-large (335M params, CLS pooling).
- Statistical power analysis confirming that sample sizes of 25–56 pairs per category provide power > 0.99 for the observed effect sizes (Cohen's d > 2.0 for entity swaps).
- Cross-encoder mitigation analysis evaluating four cross-encoder architectures on the same 371 test pairs, demonstrating that cross-attention resolves most bi-encoder failures but reveals new failure modes tied to training objective mismatch.
These findings have immediate practical implications: any RAG system using these models without additional safeguards is susceptible to returning semantically opposite information — with potentially life-threatening consequences in healthcare applications.
2. Related Work
2.1 Negation in NLP
The challenge of negation for neural language models has been recognized across multiple NLP tasks. Ettinger (2020) tested BERT's sensitivity to negation using psycholinguistic diagnostics and found systematic failures in distinguishing affirmative from negated statements at the representation level. These works focused on the underlying language model (BERT, RoBERTa) rather than the downstream sentence embedding models that practitioners deploy in production.
Kassner and Schütze (2020) showed that BERT assigns high probability to contradictory statements when negation is present, suggesting the issue originates in pre-training. Our work extends these findings to the sentence embedding domain, where the practical impact is arguably greater due to the scale of deployment.
2.2 Sentence Embedding Architectures
Modern sentence embedding models overwhelmingly use mean pooling over transformer hidden states as their aggregation strategy. This choice, popularized by Reimers and Gurevych (2019) in Sentence-BERT, averages all token representations in the final layer to produce a fixed-length sentence vector. While effective for capturing overall semantic content, mean pooling is inherently a permutation-sensitive operation only to the extent that individual token representations encode positional information — and as we show in Section 5, this positional information is severely diluted by the averaging operation.
Alternative pooling strategies include CLS pooling (using the [CLS] token representation), max pooling (element-wise maximum across tokens), and attention-weighted pooling. We examine the implications of pooling choice for semantic discrimination in our mechanistic analysis, aided by the fact that our model set includes both mean-pooling (MiniLM, Nomic, mxbai) and CLS-pooling (BGE-large, GTE-large) architectures.
2.3 BGE Model Family
Xiao et al. (2023) introduced the BGE (BAAI General Embedding) family of models, trained with a multi-stage pipeline including contrastive pre-training and fine-tuning with hard negatives. BGE-large-en-v1.5, one of the models in our evaluation, uses CLS pooling rather than mean pooling. Despite its strong performance on standard benchmarks, our evaluation reveals that it shares the same fundamental limitations as mean-pooling models for the failure modes we test.
2.4 Cross-Encoders as Semantic Discriminators
Cross-encoder models (Nogueira and Cho, 2020) process both sentences simultaneously through a single transformer, enabling full cross-attention between all tokens. This architectural difference means cross-encoders do not suffer from the information bottleneck of fixed-size independent embeddings. However, cross-encoders are computationally expensive (O(n²) in combined sequence length) and cannot be used for efficient retrieval from large corpora. They are typically deployed as rerankers after an initial bi-encoder retrieval stage. The extent to which cross-encoders mitigate the specific failure modes we identify in bi-encoders has not been systematically evaluated on controlled adversarial benchmarks — a gap our work addresses.
3. Experimental Setup
3.1 Models Tested
We evaluated five widely-deployed embedding models, selected based on download counts, production usage, and representation of different model families and pooling strategies:
| Model | Short Name | Parameters | Embedding Dim | Pooling |
|---|---|---|---|---|
| sentence-transformers/all-MiniLM-L6-v2 | MiniLM-L6 | 22M | 384 | Mean |
| BAAI/bge-large-en-v1.5 | BGE-large | 335M | 1024 | CLS |
| nomic-ai/nomic-embed-text-v1.5 | Nomic-v1.5 | 137M | 768 | Mean |
| mixedbread-ai/mxbai-embed-large-v1 | mxbai-large | 335M | 1024 | CLS |
| thenlper/gte-large | GTE-large | 335M | 1024 | CLS |
All experiments were conducted using PyTorch 2.4.0 (CPU) and sentence-transformers 3.0.1. Models were evaluated on CPU using deterministic computation (results are identical on GPU). For Nomic-v1.5, we prepended the recommended "search_query: " prefix to all input sentences as specified in the model documentation.
3.2 Test Pair Construction
We manually crafted 286 adversarial sentence pairs spanning six failure modes, plus 85 control pairs, for a total of 371 sentence pairs. All pairs were written by the authors — none were generated by a language model. This ensures each pair represents a genuinely meaningful semantic distinction and avoids systematic biases from generation. The failure modes are:
Negation Blindness (55 pairs): Sentence pairs where one contains an explicit negation of the other. Example: "The patient has diabetes" vs. "The patient does not have diabetes." Domains: medical (15), legal (10), financial (10), product/technology (10), general/safety (10). These pairs have opposite truth values and should produce low similarity.
Numerical Insensitivity (56 pairs): Pairs where a numerical value changes by an order of magnitude or more. Example: "Take 5mg of aspirin daily" vs. "Take 500mg of aspirin daily." Domains: medical dosage (12), financial figures (10), time/distance (10), age/demographics (5), quantities (13), precision (6).
Entity/Role Swaps (45 pairs): Pairs where two entities swap syntactic roles, reversing the direction of an action. Example: "Google acquired YouTube" vs. "YouTube acquired Google." Vocabulary and topic are identical; only semantic roles change. Categories: acquisitions/business (8), interpersonal (10), comparisons (8), cause/direction (9), attribution (10).
Temporal Inversion (35 pairs): Pairs where temporal ordering is reversed. Example: "The building was evacuated before the explosion" vs. "The building was evacuated after the explosion." Categories: medical sequences (8), business/financial (7), causal/procedural (8), historical/narrative (12).
Scope/Quantifier Sensitivity (35 pairs): Pairs where quantifiers change (all→some→none). Example: "All patients responded to treatment" vs. "No patients responded to treatment." Categories: medical (8), technology (7), legal (5), general (15).
Hedging/Certainty (25 pairs): Pairs where a definitive claim is replaced with a hedged version. Example: "The drug cures cancer" vs. "The drug may help with some cancer symptoms." Categories: medical (7), financial (6), general (12).
3.3 Controls
Positive Controls (35 pairs): True paraphrases — same meaning, different words. These establish the baseline for "similar meaning" and should produce high cosine similarity.
Negative Controls (35 pairs): Completely unrelated sentence pairs. These establish the baseline for "different meaning" and should produce low cosine similarity.
Near-Miss Controls (15 pairs): Sentences differing in a minor but meaningful detail. Example: "The patient has type 1 diabetes" vs. "The patient has type 2 diabetes." These establish the boundary between legitimate similar sentences and adversarial pairs.
3.4 Metrics
For each model × category combination, we compute:
- Mean cosine similarity ± standard deviation (SD)
- Median, interquartile range (IQR), minimum, and maximum — to characterize the full distribution
- Multi-threshold failure rates: percentage of pairs with cosine similarity exceeding thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9
- Cohen's d effect size: standardized mean difference between failure-mode similarities and positive control similarities
- Severity ratio: mean failure-mode similarity / mean positive-control similarity (values ≥ 1.0 indicate the model treats failures as more similar than actual paraphrases)
3.5 Statistical Power
Given the observed effect sizes (Cohen's d ranging from 0.32 to 3.40 across models and failure categories), we performed a post-hoc power analysis. For the entity/role swap category (d > 2.0 across all models), even a sample of n = 10 would achieve power > 0.99 with α = 0.05. For the smallest meaningful effect size observed (d = 0.91, Nomic numerical), with our sample size of n = 56, power exceeds 0.99. Our sample sizes of 25–56 pairs per category are therefore substantially more than adequate for detecting the effects observed.
4. Results
4.1 Overview: Entity/Role Swaps Are the Most Severe Failure
The most striking finding is that entity/role swaps produce higher cosine similarity than actual paraphrases across all five models. "Google acquired YouTube" and "YouTube acquired Google" are considered more semantically similar by every model tested than "The cat sat on the mat" and "A feline rested on the rug."
Table 1: Mean Cosine Similarity (± SD) by Model and Category
| Category | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Positive Control | 0.755 ± 0.103 | 0.906 ± 0.044 | 0.874 ± 0.056 | 0.910 ± 0.046 | 0.948 ± 0.023 |
| Negative Control | 0.009 ± 0.055 | 0.391 ± 0.069 | 0.471 ± 0.048 | 0.300 ± 0.057 | 0.713 ± 0.020 |
| Near-Miss Control | 0.795 ± 0.107 | 0.872 ± 0.055 | 0.876 ± 0.052 | 0.851 ± 0.064 | 0.934 ± 0.020 |
| Entity/Role Swap | 0.989 ± 0.011 | 0.985 ± 0.008 | 0.991 ± 0.005 | 0.982 ± 0.011 | 0.989 ± 0.014 |
| Temporal Inversion | 0.971 ± 0.015 | 0.926 ± 0.029 | 0.962 ± 0.021 | 0.931 ± 0.030 | 0.973 ± 0.013 |
| Negation | 0.898 ± 0.053 | 0.867 ± 0.038 | 0.938 ± 0.021 | 0.837 ± 0.043 | 0.939 ± 0.016 |
| Numerical | 0.858 ± 0.069 | 0.885 ± 0.049 | 0.918 ± 0.043 | 0.872 ± 0.060 | 0.947 ± 0.024 |
| Quantifier | 0.853 ± 0.074 | 0.802 ± 0.067 | 0.891 ± 0.053 | 0.799 ± 0.079 | 0.929 ± 0.028 |
| Hedging | 0.764 ± 0.101 | 0.822 ± 0.080 | 0.836 ± 0.068 | 0.825 ± 0.085 | 0.910 ± 0.035 |
All failure mode categories produce similarities dramatically higher than negative controls, and in most cases higher than or comparable to positive controls. Entity/role swaps consistently exceed positive control similarity by a wide margin in every model.
4.2 Distribution Characteristics
To address concerns about whether high failure rates could result from a degenerate distribution, we report full distributional statistics for the most severe failure mode (entity/role swaps) and the failure mode with the widest distribution (negation):
Table 2: Entity/Role Swap Distribution Statistics
| Statistic | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Mean | 0.989 | 0.985 | 0.991 | 0.982 | 0.989 |
| Median | 0.992 | 0.987 | 0.993 | 0.984 | 0.994 |
| SD | 0.011 | 0.008 | 0.005 | 0.011 | 0.014 |
| Min | 0.925 | 0.958 | 0.973 | 0.937 | 0.934 |
| Max | 0.999 | 0.997 | 0.998 | 0.998 | 0.998 |
| Q25 | 0.985 | 0.981 | 0.989 | 0.976 | 0.988 |
| Q75 | 0.996 | 0.991 | 0.995 | 0.991 | 0.997 |
| IQR | 0.011 | 0.010 | 0.006 | 0.015 | 0.009 |
| N | 45 | 45 | 45 | 45 | 45 |
Even the minimum similarity for entity swaps across all models (0.925 for MiniLM-L6) vastly exceeds the typical retrieval threshold of 0.7. The tight IQR (0.006–0.015) confirms this is a consistent phenomenon, not driven by outliers.
Table 3: Negation Distribution Statistics
| Statistic | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Mean | 0.898 | 0.867 | 0.938 | 0.837 | 0.939 |
| Median | 0.907 | 0.868 | 0.942 | 0.838 | 0.940 |
| SD | 0.053 | 0.038 | 0.021 | 0.043 | 0.016 |
| Min | 0.724 | 0.771 | 0.882 | 0.733 | 0.892 |
| Max | 0.979 | 0.958 | 0.975 | 0.950 | 0.969 |
| Q25 | 0.863 | 0.839 | 0.922 | 0.808 | 0.929 |
| Q75 | 0.937 | 0.898 | 0.955 | 0.865 | 0.950 |
| IQR | 0.074 | 0.059 | 0.033 | 0.057 | 0.021 |
| N | 55 | 55 | 55 | 55 | 55 |
Negation shows wider variance than entity swaps (SD 0.016–0.053 vs. 0.005–0.014), which is expected: some negated pairs involve more surface-level change ("The patient has diabetes" → "The patient does not have diabetes") than others. The lowest negation similarity observed (0.724 for MiniLM-L6) is the closest any failure pair comes to the 0.7 threshold — and even this scores above it.
4.3 Multi-Threshold Failure Rates
We report failure rates at five thresholds to characterize how the failure degrades across operating points:
Table 4: Failure Rates (%) at Multiple Similarity Thresholds — Entity/Role Swap
| Threshold | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| > 0.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.6 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.7 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.8 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.9 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Entity/role swaps achieve 100% failure rate even at the strictest threshold of 0.9 across all five models. No choice of threshold can mitigate this failure.
Table 5: Failure Rates (%) at Multiple Similarity Thresholds — Negation
| Threshold | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| > 0.5 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.6 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.7 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| > 0.8 | 98.2 | 96.4 | 100.0 | 83.6 | 100.0 |
| > 0.9 | 49.1 | 23.6 | 96.4 | 10.9 | 98.2 |
Negation shows some threshold sensitivity: mxbai-large drops to 10.9% failure at threshold 0.9, while Nomic-v1.5 and GTE-large remain above 96%. However, at any practical retrieval threshold (0.7), all models fail on 100% of negation pairs.
Table 6: Failure Rates (%) at Threshold 0.7 — All Categories
| Failure Mode | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Entity/Role Swap | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Temporal | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Negation | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Numerical | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Quantifier | 100.0 | 97.1 | 100.0 | 88.6 | 100.0 |
| Hedging | 76.0 | 96.0 | 100.0 | 96.0 | 100.0 |
At the standard retrieval threshold of 0.7, all five models show 100% failure rates for entity swaps, temporal inversion, negation, and numerical changes. Even quantifiers and hedging show near-total failure.
Table 7: Failure Rates (%) at Threshold 0.9 — All Categories
| Failure Mode | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Entity/Role Swap | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Temporal | 100.0 | 88.6 | 100.0 | 88.6 | 100.0 |
| Negation | 49.1 | 23.6 | 96.4 | 10.9 | 98.2 |
| Numerical | 32.1 | 42.9 | 66.1 | 39.3 | 94.6 |
| Quantifier | 31.4 | 5.7 | 48.6 | 11.4 | 74.3 |
| Hedging | 8.0 | 24.0 | 24.0 | 32.0 | 52.0 |
At the strictest 0.9 threshold, entity/role swaps remain at 100% failure across all five models. Temporal inversion remains near-total (88.6–100%). Even negation — the "easiest" to detect among the top three failures — still shows majority failure rates for three of five models.
4.4 Severity Analysis
We compute the severity ratio — the ratio of mean failure-mode similarity to mean positive-control similarity. A ratio ≥ 1.0 means the model treats failure pairs as more similar than actual paraphrases.
Table 8: Severity Ratios by Model and Category
| Category | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large | Cross-Model Avg |
|---|---|---|---|---|---|---|
| Entity Swap | 1.310 | 1.087 | 1.134 | 1.079 | 1.043 | 1.131 |
| Temporal | 1.286 | 1.022 | 1.101 | 1.023 | 1.026 | 1.092 |
| Negation | 1.190 | 0.957 | 1.074 | 0.920 | 0.990 | 1.026 |
| Numerical | 1.136 | 0.977 | 1.050 | 0.959 | 0.999 | 1.024 |
| Quantifier | 1.130 | 0.885 | 1.020 | 0.878 | 0.980 | 0.979 |
| Hedging | 1.012 | 0.908 | 0.957 | 0.907 | 0.961 | 0.949 |
Entity/role swaps are treated as 4.3–31.0% more similar than paraphrases depending on the model. Temporal inversions are treated as 2.2–28.6% more similar. Even negation — which involves adding a word — shows severity ratios above 1.0 for three of five models.
4.5 Effect Sizes
Table 9: Cohen's d (Failure Mode vs. Positive Controls)
| Failure Mode | MiniLM-L6 | BGE-large | Nomic-v1.5 | mxbai-large | GTE-large |
|---|---|---|---|---|---|
| Entity/Role Swap | 3.40 | 2.69 | 3.18 | 2.30 | 2.19 |
| Temporal | 2.93 | 0.54 | 2.09 | 0.55 | 1.31 |
| Negation | 1.87 | −0.96 | 1.68 | −1.64 | −0.50 |
| Numerical | 1.23 | −0.44 | 0.91 | −0.68 | −0.06 |
| Quantifier | 1.09 | −1.85 | 0.32 | −1.72 | −0.77 |
| Hedging | 0.09 | −1.37 | −0.62 | −1.31 | −1.32 |
Positive Cohen's d values here indicate the failure-mode pairs are scored higher than positive controls. Entity/role swaps show enormous positive effect sizes (d = 2.19 to 3.40) across all models — these are very large effects by any standard. For reference, a Cohen's d of 2.0 is considered "very large" in the social sciences; our entity-swap effects exceed this in every model.
4.6 Cross-Model Comparison
Comparing models across the three most severe failure modes:
Table 10: Model Vulnerability Ranking
| Rank | Model | Pooling | Entity Swap | Negation | Temporal | Pattern |
|---|---|---|---|---|---|---|
| 1 (most vulnerable) | Nomic-v1.5 | Mean | 0.991 | 0.938 | 0.962 | Highest or near-highest everywhere |
| 2 | GTE-large | CLS | 0.989 | 0.939 | 0.973 | Highest temporal, tied highest negation |
| 3 | MiniLM-L6 | Mean | 0.989 | 0.898 | 0.971 | High entity/temporal, moderate negation |
| 4 | BGE-large | CLS | 0.985 | 0.867 | 0.926 | Lowest negation, lowest temporal |
| 5 (most robust) | mxbai-large | CLS | 0.982 | 0.837 | 0.931 | Lowest across most categories |
mxbai-large shows the best (though still catastrophically inadequate) discrimination, particularly for negation (0.837) and quantifiers (0.799). However, the differences between models are small relative to the gap between failure modes and the ideal (low similarity). All five models fail catastrophically for entity swaps and temporal inversion regardless of architecture, size, or pooling strategy.
Interestingly, the two CLS-pooling models (BGE-large and mxbai-large) and MiniLM-L6 show somewhat better negation sensitivity than the other models, but this advantage does not extend to entity swaps where all models perform essentially identically (0.982–0.991).
4.7 Selected Worst-Case Examples
To illustrate practical danger, we present specific pairs with their similarity scores across models:
Medical — Negation (Life-safety relevant):
- "The patient is allergic to penicillin" / "The patient is not allergic to penicillin"
- MiniLM: 0.870 | BGE: 0.839 | Nomic: 0.929 | mxbai: 0.812 | GTE: 0.929
Medical — Numerical (Dosage errors):
- "Take 5mg of aspirin daily" / "Take 500mg of aspirin daily"
- MiniLM: 0.894 | BGE: 0.930 | Nomic: 0.919 | mxbai: 0.913 | GTE: 0.964
Entity Swap (Business intelligence):
- "Google acquired YouTube" / "YouTube acquired Google"
- MiniLM: 0.986 | BGE: 0.984 | Nomic: 0.990 | mxbai: 0.980 | GTE: 0.987
Entity Swap (Financial):
- "Alice sent money to Bob" / "Bob sent money to Alice"
- MiniLM: 0.981 | BGE: 0.984 | Nomic: 0.988 | mxbai: 0.981 | GTE: 0.988
Temporal (Medical safety):
- "The building was evacuated before the explosion" / "The building was evacuated after the explosion"
- MiniLM: 0.978 | BGE: 0.949 | Nomic: 0.980 | mxbai: 0.952 | GTE: 0.981
Even the lowest-scoring entity swap pair across all models ("Team A beat Team B in the finals" / "Team B beat Team A in the finals" at 0.925 on MiniLM) vastly exceeds any reasonable retrieval threshold.
5. Mechanistic Analysis
5.1 The Mean Pooling Hypothesis
The severity ranking of failure modes provides a critical clue about the underlying mechanism. Failures correlate inversely with the degree of surface-level change:
- Entity swaps (cross-model mean 0.987): Zero vocabulary change, only position changes
- Temporal inversion (0.953): One word changes (before↔after)
- Negation (0.896): One word added ("not")
- Numerical (0.896): One token changes (a number)
- Quantifiers (0.855): One word changes (all↔none)
- Hedging (0.831): Multiple words change
This pattern is consistent with a bag-of-words hypothesis: the models primarily encode which tokens are present, not how they are arranged. To test this mechanistically, we performed token-level embedding analysis.
5.2 Token-Level Embedding Decomposition
For entity-swap pairs, we extracted the hidden-state representation of each token position from the final transformer layer (before pooling) of MiniLM-L6 and computed cosine similarity between corresponding positions. This reveals whether the transformer's attention mechanism encodes positional/role information that is subsequently lost during pooling.
Experiment 1: "Google acquired YouTube" vs. "YouTube acquired Google"
| Position | Token A | Token B | Cosine Sim |
|---|---|---|---|
| 0 | [CLS] | [CLS] | 0.990 |
| 1 | youtube | 0.636 | |
| 2 | acquired | acquired | 0.986 |
| 3 | youtube | 0.596 | |
| 4 | [SEP] | [SEP] | 0.984 |
The swapped entity tokens show dramatically lower per-position similarity (0.596 and 0.636) compared to unchanged positions (0.984–0.990). The transformer attention layers do encode positional and role information — "google" in subject position has a substantially different representation from "google" in object position.
However, mean pooling averages all positions equally. Despite the discriminative signal at positions 1 and 3, the mean-pooled sentence similarity is 0.985. The non-discriminative dimensions dominate.
Experiment 2: "The cat chased the dog" vs. "The dog chased the cat"
| Position | Token A | Token B | Cosine Sim |
|---|---|---|---|
| 0 | [CLS] | [CLS] | 0.995 |
| 1 | the | the | 0.978 |
| 2 | cat | dog | 0.672 |
| 3 | chased | chased | 0.991 |
| 4 | the | the | 0.985 |
| 5 | dog | cat | 0.669 |
| 6 | [SEP] | [SEP] | 0.984 |
Again, the swapped entity positions show low similarity (~0.67) while all other positions exceed 0.978. Mean-pooled similarity: 0.991. Dilution ratio: 2.5 same-position tokens per swapped token.
Experiment 3: "Alice sent money to Bob" vs. "Bob sent money to Alice"
| Position | Token A | Token B | Cosine Sim |
|---|---|---|---|
| 0 | [CLS] | [CLS] | 0.994 |
| 1 | alice | bob | 0.649 |
| 2 | sent | sent | 0.994 |
| 3 | money | money | 0.996 |
| 4 | to | to | 0.979 |
| 5 | bob | alice | 0.725 |
| 6 | [SEP] | [SEP] | 0.973 |
The pattern is consistent: swapped tokens show 0.649–0.725 similarity while unchanged positions remain above 0.973. Mean-pooled similarity: 0.981.
Experiment 4: "The teacher praised the student" vs. "The student praised the teacher"
| Position | Token A | Token B | Cosine Sim |
|---|---|---|---|
| 0 | [CLS] | [CLS] | 0.995 |
| 1 | the | the | 0.975 |
| 2 | teacher | student | 0.700 |
| 3 | praised | praised | 0.996 |
| 4 | the | the | 0.976 |
| 5 | student | teacher | 0.696 |
| 6 | [SEP] | [SEP] | 0.990 |
Mean-pooled similarity: 0.992.
Experiment 5: "The doctor diagnosed the patient" vs. "The patient diagnosed the doctor"
| Position | Token A | Token B | Cosine Sim |
|---|---|---|---|
| 0 | [CLS] | [CLS] | 0.982 |
| 1 | the | the | 0.954 |
| 2 | doctor | patient | 0.703 |
| 3 | diagnosed | diagnosed | 0.987 |
| 4 | the | the | 0.951 |
| 5 | patient | doctor | 0.667 |
| 6 | [SEP] | [SEP] | 0.982 |
Mean-pooled similarity: 0.983.
Summary of Token-Level Analysis:
| Pair | Same-Pos Mean | Diff-Pos Mean | Pooled Sim | Dilution Ratio |
|---|---|---|---|---|
| Google/YouTube | 0.987 | 0.616 | 0.985 | 1.5:1 |
| cat/dog | 0.987 | 0.671 | 0.991 | 2.5:1 |
| Alice/Bob | 0.987 | 0.687 | 0.981 | 2.5:1 |
| teacher/student | 0.986 | 0.698 | 0.992 | 2.5:1 |
| Microsoft/LinkedIn | 0.984 | 0.385 | 0.984 | 0.5:1 |
| predator/prey | 0.978 | 0.846 | 0.987 | 2.5:1 |
| Team A/B | 0.907 | 0.543 | 0.925 | 4.0:1 |
| doctor/patient | 0.971 | 0.685 | 0.983 | 2.5:1 |
Across all eight pairs, the same-position (non-swapped) token mean ranges from 0.907 to 0.987, while the swapped-position token mean ranges from 0.385 to 0.846. The transformer clearly encodes different representations for the same word in different syntactic roles. But the pooled similarity (0.925–0.992) consistently washes out this distinction.
5.3 The Dilution Mechanism
The key finding is that in a sentence of N tokens, only 2 positions carry discriminative information for entity swaps (the two swapped entities). The remaining N−2 positions have near-identical representations. When mean pooling averages across all N positions, the discriminative signal from 2 positions is diluted by the N−2 non-discriminative positions.
For a 5-token sentence ("Google acquired YouTube"), the discriminative fraction is 2/5 = 40%. For a 7-token sentence ("The cat chased the dog"), it drops to 2/7 = 29%. For longer, more realistic sentences, the fraction drops further. A sentence with 20 tokens where 2 entities swap would have a discriminative fraction of only 10%.
The Microsoft/LinkedIn pair is instructive: because the tokenizer splits these into subword tokens at different positions, 4 of 6 positions differ, yet the pooled similarity is still 0.984. This confirms that the dilution mechanism operates even when many positions differ, because the high-dimensional embedding space allows substantial overlap between subword representations.
5.4 Negation: An Even Worse Dilution
For negation, the dilution is even more severe because only one token carries the discriminative signal. The "not" token constitutes:
- 1 of 8 tokens (12.5%) in "The patient does not have diabetes"
- 1 of 7 tokens (14.3%) in "The drug is not effective"
- 1 of 9 tokens (11.1%) in "The water is not safe to drink"
In longer sentences typical of real documents (20–50 tokens), the negation signal would be diluted to 2–5% of the representation — well below any practical detection threshold.
5.5 CLS Pooling vs. Mean Pooling
An important architectural consideration is the choice of pooling strategy. CLS pooling uses only the [CLS] token representation, which theoretically has access to the full sequence through self-attention and can encode compositional semantics including word order and roles.
Our results provide a natural comparison. For entity swaps:
- Mean pooling models (MiniLM, Nomic): 0.989–0.991
- CLS pooling models (BGE, mxbai, GTE): 0.982–0.989
The difference is marginal and not practically significant. CLS pooling offers no meaningful advantage for entity-swap detection. For negation, there is a slightly larger difference:
- Mean pooling models: 0.898–0.938
- CLS pooling models: 0.837–0.939
mxbai-large (CLS) achieves 0.837, the lowest negation score, but GTE-large (also CLS) achieves 0.939, the highest. This suggests the advantage comes from specific training procedures rather than pooling strategy per se.
Our token-level analysis confirms this: the [CLS] token similarity for entity-swap pairs (0.982–0.995) is comparable to or higher than the mean-pooled similarity. The [CLS] token, despite having access to the full attention context, does not encode strong role-distinguishing information in any of the models tested.
5.6 Formal Analysis of the Token-to-Pooled Similarity Gap
A natural question is: if the average per-position token similarity is ~0.84, why is the pooled sentence similarity ~0.98? We provide a geometric explanation.
Let h_i^A and h_i^B be the hidden states at position i for sentences A and B, each in ℝ^d (where d = 384 for MiniLM-L6). The mean-pooled sentence vectors are:
s^A = (1/N) Σ_i h_i^A, s^B = (1/N) Σ_i h_i^B
Expanding the dot product:
s^A · s^B = (1/N²) Σ_i Σ_j (h_i^A · h_j^B)
This is a sum over all pairs (i, j), not just corresponding positions (i, i). The cross-terms (h_i^A · h_j^B for i ≠ j) are not zero — tokens at different positions share substantial overlap in embedding space because they are contextually related. For entity swaps specifically, "google" at position 1 in sentence A has high dot product with "google" at position 3 in sentence B (same token, different position). The cross-position similarity captures the vocabulary overlap that mean pooling preserves, while the within-position similarity at swapped positions captures the role difference that mean pooling averages away.
This is why the pooled similarity (0.98+) is higher than the average of per-position similarities (~0.84): the cross-terms contribute positively and dominate the sum. Mean pooling computes a global average over all pairwise token interactions, and since the sentences share all vocabulary, most interactions are between identical or highly related tokens.
5.7 Implications for Model Architecture
Our mechanistic analysis points to three levels at which failures originate:
Pooling layer: Mean pooling is provably lossy for compositional semantics. Any information that is position-dependent and affects only a minority of tokens will be diluted below the discrimination threshold. CLS pooling fares no better in practice.
Training objective: Contrastive learning with paraphrase pairs does not create explicit pressure to separate negated, role-reversed, or numerically-altered sentences. The loss function is agnostic to these distinctions. Including hard negatives (negated pairs, role-reversed pairs) during training would likely improve discrimination.
Attention mechanism: While attention layers do encode position-dependent information (as shown by our token analysis), this information is not sufficiently strong to survive pooling. Hard negatives during training would likely strengthen these attention patterns to produce larger differences at swapped positions.
6. Cross-Encoder Mitigation Analysis
A natural question arising from our bi-encoder results is whether cross-encoder architectures — which perform full cross-attention between query and document tokens rather than encoding them independently — can mitigate the identified failure modes. We further evaluated four cross-encoder models on the same 371 test pairs to directly address this question.
6.1 Cross-Encoder Models Tested
We evaluated four cross-encoder models spanning different training objectives:
| Model | Short Name | Training Objective | Output Range |
|---|---|---|---|
| cross-encoder/stsb-roberta-large | STS-B | Semantic textual similarity | 0–1 (continuous) |
| cross-encoder/ms-marco-MiniLM-L-12-v2 | MS-MARCO | Passage relevance ranking | Unbounded logits |
| BAAI/bge-reranker-large | BGE-reranker | Relevance reranking | 0–1 (sigmoid) |
| cross-encoder/quora-roberta-large | Quora-RoBERTa | Duplicate question detection | 0–1 (sigmoid) |
These models were selected to represent the major cross-encoder paradigms: semantic similarity (STS-B), information retrieval relevance (MS-MARCO, BGE-reranker), and paraphrase/duplicate detection (Quora-RoBERTa). All experiments used the same sentence pairs and were run on the same hardware (CPU, PyTorch 2.4.0).
6.2 Results: Cross-Encoders Dramatically Resolve Most Failures
Table 11: Cross-Encoder Raw Scores by Category (Mean ± SD)
| Category | STS-B | MS-MARCO | BGE-reranker | Quora-RoBERTa |
|---|---|---|---|---|
| Positive Control | 0.889 ± 0.100 | 4.051 ± 3.122 | 0.996 ± 0.010 | 0.894 ± 0.188 |
| Negative Control | 0.010 ± 0.001 | −11.142 ± 0.123 | 0.000 ± 0.000 | 0.005 ± 0.000 |
| Entity/Role Swap | 0.837 ± 0.189 | 8.999 ± 0.674 | 0.398 ± 0.298 | 0.037 ± 0.053 |
| Temporal Inversion | 0.668 ± 0.104 | 8.362 ± 0.582 | 0.073 ± 0.142 | 0.038 ± 0.047 |
| Negation | 0.491 ± 0.041 | 8.210 ± 0.690 | 0.073 ± 0.082 | 0.020 ± 0.029 |
| Numerical | 0.454 ± 0.068 | 5.831 ± 1.962 | 0.114 ± 0.222 | 0.018 ± 0.042 |
| Quantifier | 0.563 ± 0.130 | 6.621 ± 1.517 | 0.281 ± 0.415 | 0.168 ± 0.213 |
| Hedging | 0.652 ± 0.175 | 2.384 ± 4.931 | 0.883 ± 0.225 | 0.514 ± 0.429 |
Note: MS-MARCO outputs unbounded relevance logits (not 0–1 probabilities), so raw scores are not directly comparable to other models. Higher scores indicate greater predicted relevance.
The results reveal a striking divergence across cross-encoder architectures:
Quora-RoBERTa achieves near-perfect discrimination. Negation pairs score 0.020 (vs. 0.894 positive controls), entity swaps score 0.037 (vs. 0.894), temporal inversions score 0.038, and numerical differences score 0.018. At any threshold above 0.5, the failure rate for negation, numerical, entity swap, and temporal categories drops to 0% — a complete resolution of the bi-encoder failures. The model correctly identifies that "Google acquired YouTube" and "YouTube acquired Google" are not paraphrases (score: 0.008), that "The patient has diabetes" and "The patient does not have diabetes" are not duplicates (score: 0.012), and that "Take 5mg of aspirin daily" and "Take 500mg of aspirin daily" are meaningfully different (score: 0.013).
BGE-reranker shows strong but selective improvement. It achieves near-zero scores for negation (0.073) and temporal inversion (0.073), but shows moderate residual failure for entity swaps (0.398 mean, with 24.4% of pairs scoring above 0.7). Critically, BGE-reranker fails completely on hedging (0.883), scoring hedged pairs nearly as high as true paraphrases (0.996). This suggests the reranker treats "The drug cures cancer" and "The drug may help with some cancer symptoms" as equally relevant for retrieval purposes — which may be correct from a topical relevance perspective but fails to capture the semantic distinction.
Table 12: Cross-Encoder Failure Rates at Threshold 0.5 (Raw Scores)
| Failure Mode | STS-B | MS-MARCO | BGE-reranker | Quora-RoBERTa |
|---|---|---|---|---|
| Entity Swap | 93.3% | 100.0% | 33.3% | 0.0% |
| Temporal | 100.0% | 100.0% | 2.9% | 0.0% |
| Negation | 43.6% | 100.0% | 0.0% | 0.0% |
| Numerical | 26.8% | 98.2% | 5.4% | 0.0% |
| Quantifier | 57.1% | 100.0% | 28.6% | 5.7% |
| Hedging | 72.0% | 64.0% | 92.0% | 52.0% |
6.3 The Training Objective Determines Success
The most striking finding is that training objective — not architecture — determines whether a cross-encoder resolves bi-encoder failures:
MS-MARCO (passage relevance) fails catastrophically. Despite having full cross-attention, MS-MARCO produces higher relevance scores for adversarial pairs than for true paraphrases. Entity swaps score 8.999 vs. 4.051 for positive controls; negation pairs score 8.210 vs. 4.051. This occurs because MS-MARCO is trained to predict topical relevance: "Google acquired YouTube" and "YouTube acquired Google" discuss the same topic (Google, YouTube, acquisitions) and are thus both highly relevant to queries about either company. The MS-MARCO cross-encoder has learned to be an excellent topic classifier but not a semantic equivalence detector. This is a topical relevance trap: the model accurately identifies that both sentences are relevant to the same query, while being blind to the critical semantic distinction between them.
STS-B (similarity regression) partially fails. Entity swaps score 0.837 (vs. 0.889 positive controls), showing that even an STS-trained cross-encoder struggles with role reversal — presumably because STS training data rarely includes entity-swap hard negatives. However, STS-B shows moderate negation sensitivity (0.491), better than any bi-encoder.
Quora-RoBERTa (duplicate detection) succeeds dramatically. This model was trained specifically to determine whether two questions are semantically equivalent. This objective inherently requires sensitivity to negation, entity roles, and numerical values — exactly the distinctions our test pairs probe. The duplicate detection framing creates natural hard negatives: "How do I get from A to B?" is not a duplicate of "How do I get from B to A?"
BGE-reranker (relevance reranking) partially succeeds. Trained on relevance reranking with hard negatives, BGE-reranker correctly identifies that negated and temporally-inverted sentences are not relevant to each other. However, it treats hedged variants as equally relevant (correctly, from a retrieval perspective) and shows residual entity-swap confusion.
6.4 Hedging Remains Unsolved
Hedging/certainty is the one failure mode where no architecture provides satisfactory results. Even Quora-RoBERTa — which achieves 0% failure rates across all other categories — fails on 52% of hedging pairs at threshold 0.5, with a mean score of 0.514. BGE-reranker scores hedging pairs at 0.883, essentially treating them as paraphrases.
This makes mechanistic sense: "The drug cures cancer" and "The drug may help with some cancer symptoms" genuinely discuss the same topic with substantial semantic overlap. The distinction is one of epistemic certainty and scope — a subtle pragmatic difference that neither bi-encoders nor cross-encoders reliably capture. This suggests that hedging detection may require specialized models or post-processing that explicitly tracks certainty markers.
6.5 Implications for Retrieval Pipeline Design
These results suggest a clear architectural recommendation for safety-critical retrieval:
- Use bi-encoders for initial retrieval (fast, scalable candidate generation)
- Rerank with a Quora-RoBERTa-style cross-encoder trained on duplicate/paraphrase detection rather than topical relevance
- Do NOT rely on MS-MARCO-style relevance rerankers for semantic equivalence — they will amplify rather than mitigate bi-encoder failures
- Implement dedicated hedging detection as a post-processing step, as no embedding architecture reliably captures certainty distinctions
The dramatic difference between MS-MARCO and Quora-RoBERTa cross-encoders — same architecture, opposite outcomes — underscores that the training objective is the decisive factor, not the presence of cross-attention per se.
7. Practical Implications
6.1 Medical RAG Systems
The failures documented here have direct safety implications for medical RAG deployments:
Negation blindness in allergy records: A system querying patient records with "Is the patient allergic to penicillin?" would retrieve documents stating both "The patient is allergic to penicillin" and "The patient is not allergic to penicillin" with nearly identical relevance scores (0.81–0.93). If the downstream LLM does not independently resolve the negation, the result could be a life-threatening drug interaction.
Dosage confusion: "Take 5mg of aspirin daily" and "Take 500mg of aspirin daily" score 0.89–0.96 similarity across models. A 100× dosage error retrieved with high confidence could lead to serious adverse events.
Temporal ordering in treatment: "Apply the bandage before cleaning the wound" vs. "Apply the bandage after cleaning the wound" — these describe opposite clinical procedures but score 0.93–0.98 similarity.
Recommendation: Medical RAG systems should implement: (a) cross-encoder reranking using a model trained on semantic equivalence (e.g., Quora-RoBERTa-style duplicate detection) rather than topical relevance (e.g., MS-MARCO) — our experiments show the former reduces negation failure rates from 100% to 0% while the latter amplifies failures, (b) keyword-level negation detection as a post-retrieval filter, (c) numerical extraction and comparison for dosage-related queries, and (d) human-in-the-loop verification for all medication-related retrievals.
6.2 Legal Search
Precedent confusion: A search for "cases where the defendant was found guilty" would retrieve cases where "the defendant was found not guilty" with equal relevance.
Entity role confusion: "The plaintiff sued the defendant" vs. "The defendant sued the plaintiff" describe different legal postures with different implications. All five embedding models score these above 0.98.
Recommendation: Legal retrieval should supplement embedding search with structured metadata (case outcome tags, party roles) and boolean keyword filters.
6.3 Financial Analysis
Corporate action confusion: "Company A acquired Company B" and "Company B acquired Company A" describe completely different corporate events with different implications for stock prices, market capitalization, and regulatory compliance. These score 0.98+ similarity across all models.
Temporal ordering: "The stock price rose before the earnings announcement" vs. "The stock price rose after the earnings announcement" have fundamentally different implications for insider trading analysis, yet score 0.93+ similarity.
Recommendation: Financial RAG systems should implement entity role extraction (who acquired whom) as a structured post-processing step, and temporal relation extraction for event-sequencing queries.
8. Limitations
English only: All test pairs are in English. Failure modes may differ in languages with different negation structures or word order patterns. Languages with free word order (e.g., Turkish, Japanese) may show different entity-swap sensitivity, while languages with morphological negation (e.g., German un-) may exhibit different negation failure profiles.
Sentence-level: We test individual sentences. Document-level embeddings with more context may behave differently, though our token dilution analysis suggests longer texts would exacerbate rather than alleviate the problem — the discriminative token fraction shrinks as context grows.
Five models: While we selected widely-used models spanning different architectures and pooling strategies, the embedding landscape is vast. Instruction-tuned models (e.g., E5-mistral) and domain-specific models may perform differently.
CPU inference: All models were run on CPU. This does not affect numerical results (computation is deterministic) but limited throughput.
Manual pair construction: While hand-crafted pairs avoid generation bias, the selection of domains and phrasings reflects the authors' priorities. The test suite should be seen as a targeted probe, not a comprehensive benchmark.
No instruction-tuned models: Models with explicit instruction-following capability may show different behavior. However, in standard retrieval pipelines, models are not typically given such instructions.
Sample sizes: Our per-category sample sizes (25–56 pairs) are sufficient for statistical power (Section 3.5) but smaller than large-scale benchmarks. We prioritized pair quality and manual construction over automated generation at scale. Future work could expand the test suite through carefully supervised generation to increase coverage across domains and linguistic constructions.
GTE-large calibration: GTE-large exhibits notably high negative control similarity (0.713) compared to other models (0.009–0.471), suggesting a compressed dynamic range with an elevated similarity floor. This is likely attributable to its training data distribution and normalization strategy, which produce higher baseline cosine similarity between arbitrary sentence pairs. While this does not affect relative comparisons within GTE-large (failure modes still rank identically), it means absolute threshold values are not portable across models, and practitioners deploying GTE-large should calibrate thresholds to its specific similarity distribution.
Alternative pooling strategies not tested: While our mechanistic analysis identifies mean pooling as the primary cause of signal dilution, we did not experimentally evaluate alternative pooling strategies. Max pooling — which takes the element-wise maximum across token representations — would amplify the most distinctive feature at each embedding dimension, potentially preserving the discriminative signal from negation tokens or swapped entities that mean pooling averages away. For entity swaps where swapped tokens show per-position similarity of only 0.596–0.725, max pooling could retain the distinctive representation of the most informative token position rather than averaging it with N−2 non-discriminative positions. Attention-weighted pooling — where a learned attention mechanism assigns weights to each token before averaging — could learn to upweight semantically critical tokens such as negation markers, quantifiers, and entity names. During contrastive training with hard negatives (negated pairs, entity-swapped pairs), the attention weights would receive gradient signal to attend to the tokens that distinguish the hard negatives from the anchors. Learnable pooling approaches such as parameterized [CLS]-style aggregation with auxiliary training objectives could combine the benefits of both strategies. We consider experimental evaluation of these pooling alternatives a high-priority direction for future work.
9. Discussion and Conclusion
We have demonstrated that five widely-deployed production bi-encoder embedding models exhibit systematic, catastrophic failures in encoding fundamental semantic distinctions. Entity/role swaps produce similarity scores 13.1% higher than true paraphrases on average (cross-model mean 0.987 vs. 0.878 for paraphrases). At the standard retrieval threshold of 0.7, 100% of entity-swapped, temporally-inverted, negated, and numerically-altered pairs would be retrieved as identical across all models. Even at threshold 0.9, entity swaps maintain a 100% failure rate across all five models.
Our mechanistic analysis reveals the root cause: mean pooling acts as an approximate bag-of-words, averaging away the position-dependent information that transformer attention layers encode. For entity swaps, the two swapped tokens show per-position similarity of only 0.596–0.725 — proving the model does encode role information at the token level — but mean pooling dilutes this signal with N−2 unchanged positions, producing final similarities above 0.92. For negation, the single "not" token constitutes only 11–14% of the sequence and is similarly diluted. CLS pooling, used by three of our five models (BGE-large, mxbai-large, GTE-large), provides no meaningful improvement for entity-swap detection and only marginal improvement for negation.
Our cross-encoder analysis (Section 6) reveals that these failures are not inherent to transformer architectures but are specific to the bi-encoder paradigm and its pooling-based aggregation. Cross-encoders with appropriate training objectives (Quora-RoBERTa for duplicate detection, BGE-reranker for relevance reranking) achieve 0% failure rates on negation, numerical, temporal, and entity-swap categories — reducing bi-encoder failure rates from 100% to zero. However, the training objective is decisive: MS-MARCO cross-encoders, despite having identical cross-attention architecture, amplify bi-encoder failures by treating adversarial pairs as highly topically relevant. This demonstrates that the failure is not purely architectural (pooling) but also a function of training signal — models trained without hard negatives that probe semantic equivalence will not learn to distinguish semantically opposite sentences regardless of architecture.
The one failure mode that resists all approaches is hedging/certainty — even the best cross-encoder (Quora-RoBERTa) fails on 52% of hedging pairs. This suggests that epistemic certainty is a fundamentally different semantic dimension from the propositional content that embedding models capture, and may require dedicated detection mechanisms.
These findings have three key implications for the field:
For retrieval system design: Bi-encoder retrieval should be followed by cross-encoder reranking using models trained on semantic equivalence (not topical relevance). The common practice of using MS-MARCO-trained rerankers is counterproductive for semantic precision.
For model training: Including adversarial hard negatives — negated pairs, entity-swapped pairs, numerically-altered pairs — in contrastive training data would likely improve bi-encoder discrimination. Our test suite of 371 pairs provides a starting point for such training data augmentation. Alternative pooling strategies (max pooling, attention-weighted pooling) warrant experimental evaluation as they may preserve discriminative token-level signals that mean pooling erases.
For evaluation: Standard embedding benchmarks (STS-B, MTEB) do not adequately test for these failure modes. We encourage adoption of adversarial semantic probes alongside standard benchmarks to provide a more complete picture of model capabilities.
We release our complete test suite of 371 sentence pairs and encourage the community to use them as a supplementary evaluation, to include adversarial pairs as hard negatives during training, and to explore both pooling strategies and training objectives that better preserve compositional semantics.
References
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
- Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34–48.
- Kassner, N., & Schütze, H. (2020). Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. Proceedings of ACL 2020.
- Nogueira, R., & Cho, K. (2020). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of EMNLP 2019.
- Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv preprint arXiv:2309.07597.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# SKILL.md — Reproduction Instructions ## Environment Setup ```bash # Create virtual environment python3 -m venv .venv_old source .venv_old/bin/activate # Install PyTorch (CPU-only for reproducibility) pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu # Install sentence-transformers and dependencies pip install sentence-transformers==3.0.1 pip install numpy pandas scipy scikit-learn einops # Verify versions python -c "import torch; print(torch.__version__)" # Expected: 2.4.0+cpu python -c "import sentence_transformers; print(sentence_transformers.__version__)" # Expected: 3.0.1 ``` ## Models The following models are automatically downloaded from HuggingFace on first use: 1. `sentence-transformers/all-MiniLM-L6-v2` (22M params, 384-dim, mean pooling) 2. `BAAI/bge-large-en-v1.5` (335M params, 1024-dim, CLS pooling) 3. `nomic-ai/nomic-embed-text-v1.5` (137M params, 768-dim, mean pooling, requires `trust_remote_code=True` and `einops`) 4. `mixedbread-ai/mxbai-embed-large-v1` (335M params, 1024-dim, CLS pooling) 5. `thenlper/gte-large` (335M params, 1024-dim, CLS pooling) Total disk space for all models: ~5GB ## Running Experiments ```bash # Ensure test_pairs.py is in the working directory # Run the full experiment suite: python run_v4_experiment.py # If nomic fails (missing einops), install it and run separately: pip install einops python run_nomic_only.py # Results are saved to v4_results/ ``` ## Output Structure ``` v4_results/ ├── results_all-MiniLM-L6-v2.json ├── results_bge-large-en-v1.5.json ├── results_nomic-embed-text-v1.5.json ├── results_mxbai-embed-large-v1.json ├── results_gte-large.json ├── token_analysis.json └── all_results.json ``` Each results JSON contains per-category statistics including: - Mean, median, SD, min, max, IQR, quartiles - Per-threshold failure rates (0.5, 0.6, 0.7, 0.8, 0.9) - Per-pair similarity details - Cohen's d effect sizes vs positive controls ## Test Pair Construction All 371 test pairs are in `test_pairs.py`: - 55 negation pairs (medical, legal, financial, product, safety domains) - 56 numerical pairs (dosage, financial, time/distance, demographics, quantities) - 45 entity/role swap pairs (acquisitions, interpersonal, comparisons, attribution) - 35 temporal inversion pairs (medical, business, procedural, historical) - 35 scope/quantifier pairs (all/some/none variations) - 25 hedging/certainty pairs (definitive vs hedged claims) - 35 positive controls (true paraphrases) - 35 negative controls (unrelated pairs) - 15 near-miss controls (minor detail differences) All pairs are manually written — none are LLM-generated. ## Important Notes - For Nomic-v1.5, prepend "search_query: " to all input sentences - Results are deterministic on CPU (identical across runs) - Token-level analysis uses MiniLM-L6 only (requires access to model internals) - Expect ~10-15 minutes total runtime on a modern CPU for all 5 models ## Cross-Encoder Experiments ```bash # Additional cross-encoder evaluation pip install sentence-transformers==3.0.1 # Cross-encoder models tested: # 1. cross-encoder/stsb-roberta-large (STS similarity) # 2. cross-encoder/ms-marco-MiniLM-L-12-v2 (passage relevance) # 3. BAAI/bge-reranker-large (relevance reranking) # 4. cross-encoder/quora-roberta-large (duplicate detection) # Run cross-encoder evaluation on same 371 pairs: python run_crossencoder_experiment.py # Results saved to crossencoder_results/ ``` ### Cross-Encoder Key Finding Training objective > architecture: Quora-RoBERTa (duplicate detection) achieves 0% failure rates while MS-MARCO (relevance) amplifies failures. Same cross-attention architecture, opposite outcomes.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.