Do Cross-Encoders Fix What Cosine Similarity Breaks? A Systematic Evaluation of Cross-Encoder Robustness to Compositional Semantic Failures
1. Introduction
The dominant paradigm for computing semantic similarity in modern NLP systems relies on bi-encoder architectures: two sentences are independently encoded into fixed-dimensional vectors, and their similarity is estimated via cosine distance (Reimers and Gurevych, 2019). This approach powers semantic search engines, retrieval-augmented generation pipelines, duplicate detection systems, and production applications that depend on similarity judgments.
Recent work on embedding failure modes (clawRxiv #979, "When Cosine Similarity Lies: Systematic Failures of Embedding Models on Compositional Semantics") exposed systematic failures in this paradigm. When applied to sentence pairs involving negation, entity role swaps, or numerical changes, bi-encoder models consistently assign high cosine similarity scores — treating semantically opposite or critically different sentences as near-identical. These failures stem from the architectural constraint of independent encoding: mean pooling over token embeddings creates a representation that preserves lexical overlap but erases word order, dilutes negation tokens, and ignores compositional structure.
The standard remedy proposed in the literature is the cross-encoder architecture: instead of encoding sentences independently, both sentences are concatenated with a separator token and fed through a single transformer, allowing full cross-attention between all tokens (Devlin et al., 2019; Nogueira and Cho, 2019). This architectural difference theoretically enables cross-encoders to detect word-order changes, attend to negation tokens in context, and reason about compositional relationships.
However, a critical gap exists: no systematic evaluation has tested cross-encoders against the specific failure modes identified in bi-encoder systems. We address this with a controlled experiment comparing four cross-encoder models against five bi-encoder models on 336 hand-crafted sentence pairs spanning nine semantic categories. We report all raw model outputs alongside normalized scores and provide pair-level results for full reproducibility.
2. Background
2.1 Bi-Encoder Architecture
Bi-encoder models encode each sentence independently through a shared transformer backbone, producing a fixed-dimensional embedding via mean pooling over token representations. Semantic similarity is computed as cosine similarity between embeddings. This architecture enables efficient retrieval through precomputed embeddings and approximate nearest neighbor search, but cannot perform cross-sentence reasoning during encoding.
2.2 Cross-Encoder Architecture
Cross-encoders process both sentences jointly:
score(A, B) = f(Transformer([CLS] A [SEP] B [SEP]))
where f is typically a linear classification head applied to the [CLS] token representation. Every token in sentence A attends to every token in sentence B through the transformer's self-attention mechanism, enabling direct compositional comparison. The computational cost is substantially higher — O(n) forward passes for n candidates rather than O(1) with precomputed embeddings — making cross-encoders impractical for first-stage retrieval but standard for re-ranking (Nogueira and Cho, 2019).
2.3 Training Objectives
Cross-encoder models differ fundamentally in training objective:
Semantic Textual Similarity (STS): Models trained on STS-B learn to predict human similarity ratings. The STS-B dataset uses a 0-5 annotation scale, but fine-tuned models may not use this full range on novel inputs.
Information Retrieval: Models trained on MS-MARCO learn query-document relevance via logit outputs. Relevance is not similarity: a document contradicting a query may still be highly relevant.
Duplicate Detection: Models trained on Quora Question Pairs learn to classify duplicate questions, outputting probability estimates in [0, 1].
Reranking: Production rerankers (e.g., BGE) are trained to score candidate passages, outputting relevance probabilities.
This diversity raises a key question: does the theoretical advantage of cross-attention manifest equally across training regimes?
3. Experimental Setup
3.1 Models
Five bi-encoder models:
- all-MiniLM-L6-v2 — compact 6-layer model trained via knowledge distillation
- BGE-large-en-v1.5 — large model from BAAI trained with negative mining
- nomic-embed-text-v1.5 — model from Nomic AI emphasizing long-context capability
- mxbai-embed-large-v1 — large model from Mixedbread AI
- GTE-large — general text embedding model from Thenlper/Alibaba
Four cross-encoder models:
- cross-encoder/stsb-roberta-large — RoBERTa-large fine-tuned on STS-B (raw output range observed: 0.01-0.97 on our data, despite the 0-5 training scale)
- cross-encoder/ms-marco-MiniLM-L-12-v2 — MiniLM fine-tuned on MS-MARCO (raw output: unbounded logits, observed range -11.3 to +10.6)
- BAAI/bge-reranker-large — production reranker (raw output: probabilities in [0, 1])
- cross-encoder/quora-roberta-large — RoBERTa-large fine-tuned on Quora duplicates (raw output: probabilities in [0, 1])
A fifth cross-encoder (nli-roberta-large) was planned but inaccessible due to repository authentication requirements.
3.2 Test Dataset
We use 336 hand-crafted sentence pairs from clawRxiv #979, organized into nine categories:
Adversarial categories (251 pairs):
- Negation (55 pairs): Sentence pairs differing only by negation
- Numerical (56 pairs): Order-of-magnitude numerical changes
- Entity swap (45 pairs): Subject-object role reversals
- Temporal (35 pairs): Before/after inversions
- Quantifier (35 pairs): Universal vs. existential scope changes
- Hedging (25 pairs): Certainty level modifications
Control categories (85 pairs):
- Positive control (35 pairs): True paraphrases
- Negative control (35 pairs): Completely unrelated pairs
- Near-miss control (15 pairs): Small factual differences
3.3 Scoring
For bi-encoders, we compute cosine similarity between independently encoded embeddings.
For cross-encoders, we report raw model outputs alongside normalized scores:
- STS-B: Raw output in practice ranged 0.01-0.97 on our sentence pairs. We normalize by dividing by 5 (the training scale) but emphasize that the raw range is compressed, likely because our short, simple sentences do not elicit the extreme ends of the STS-B scale.
- MS-MARCO: Raw logits ranged from -11.3 to +10.6. We apply sigmoid for [0,1] normalization, but note that values above 5 saturate near 1.0 (e.g., a logit of 8.0 maps to 0.9997).
- BGE-reranker and Quora: Output probabilities in [0,1]; used directly without transformation.
3.4 Software
All experiments used PyTorch 2.4.0 (CPU) and sentence-transformers 3.0.1.
4. Results
4.1 Bi-Encoder Failure Confirmation
All five bi-encoder models assign high similarity to adversarial pairs. The minimum cosine similarity observed across all bi-encoder models and all adversarial categories was 0.602 (all-MiniLM-L6-v2 on a hedging pair), confirming that every adversarial pair scored above 0.5 across every model.
Bi-encoder mean cosine similarity by category (averaged across 5 models, N pairs per model):
| Category | Mean | SD | Min | Max |
|---|---|---|---|---|
| Negation (55) | 0.896 | 0.054 | 0.724 | 0.979 |
| Numerical (56) | 0.896 | 0.060 | 0.719 | 0.994 |
| Entity Swap (45) | 0.987 | 0.011 | 0.925 | 0.999 |
| Temporal (35) | 0.953 | 0.030 | 0.849 | 0.995 |
| Quantifier (35) | 0.855 | 0.080 | 0.662 | 0.976 |
| Hedging (25) | 0.831 | 0.089 | 0.602 | 0.973 |
| Positive Control (35) | 0.879 | 0.089 | 0.509 | 0.993 |
| Negative Control (35) | 0.377 | 0.236 | -0.149 | 0.754 |
Entity swap pairs (0.987) score higher than true paraphrases (0.879), illustrating the severity of the bag-of-words problem: identical tokens in different order produce near-identical embeddings.
Individual bi-encoder failure rates at threshold 0.8:
| Model | Negation | Numerical | Entity Swap | Temporal | Quantifier | Hedging |
|---|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 98% | 77% | 100% | 100% | 80% | 32% |
| BGE-large-en-v1.5 | 96% | 95% | 100% | 100% | 49% | 56% |
| nomic-embed-text-v1.5 | 100% | 99% | 100% | 100% | 94% | 76% |
| mxbai-embed-large-v1 | 84% | 86% | 100% | 100% | 46% | 56% |
| GTE-large | 100% | 100% | 100% | 100% | 100% | 100% |
We note that BGE-large-en-v1.5, despite being trained with hard-negative mining, still assigns minimum cosine similarity of 0.771 to negation pairs and 0.669 to quantifier pairs — all well above meaningful detection thresholds.
4.2 Cross-Encoder Results: Raw Scores
We present raw (untransformed) model outputs to avoid normalization artifacts.
Quora-RoBERTa-Large (Duplicate Detection, raw probabilities 0-1):
| Category | Mean | SD | Min | Max |
|---|---|---|---|---|
| Negation | 0.020 | 0.029 | 0.007 | 0.169 |
| Numerical | 0.018 | 0.043 | 0.006 | 0.276 |
| Entity Swap | 0.037 | 0.053 | 0.007 | 0.190 |
| Temporal | 0.038 | 0.048 | 0.011 | 0.212 |
| Quantifier | 0.168 | 0.213 | 0.009 | 0.961 |
| Hedging | 0.514 | 0.429 | 0.007 | 0.961 |
| Positive Control | 0.894 | 0.188 | 0.010 | 0.962 |
| Negative Control | 0.005 | 0.00003 | 0.005 | 0.005 |
The negative control standard deviation (0.00003) appears negligibly small because the model assigns a near-identical "floor" probability to unrelated pairs. The 35 negative control pairs have raw scores ranging from 0.00523 to 0.00537 — genuinely clustered but not identical. The model's sigmoid output layer saturates near zero for clearly unrelated inputs, producing this tight clustering. This is not an artifact; it reflects the model's confident classification of unrelated pairs.
For adversarial categories, the model correctly assigns very low duplicate probabilities (0.02-0.04) to negation, entity swap, numerical, and temporal pairs. The separation from positive controls (0.894) is enormous. Quantifier pairs show more variance (0.009-0.961) suggesting that some quantifier changes (e.g., "all" vs. "some") are harder to detect than others (e.g., "all" vs. "none").
BGE-Reranker-Large (Production Reranker, raw probabilities 0-1):
| Category | Mean | SD | Min | Max |
|---|---|---|---|---|
| Negation | 0.073 | 0.082 | 0.001 | 0.326 |
| Numerical | 0.114 | 0.222 | 0.001 | 0.999 |
| Entity Swap | 0.398 | 0.298 | 0.014 | 0.999 |
| Temporal | 0.073 | 0.142 | 0.009 | 0.808 |
| Quantifier | 0.281 | 0.415 | 0.001 | 0.999 |
| Hedging | 0.883 | 0.225 | 0.173 | 1.000 |
| Positive Control | 0.996 | 0.010 | 0.945 | 1.000 |
| Negative Control | 0.0001 | 0.00001 | 0.0001 | 0.0001 |
The BGE reranker shows strong but not universal improvement. Negation (0.073) and temporal (0.073) pairs are well-separated from positive controls (0.996). Entity swaps (0.398) show substantial variance, with some pairs scoring up to 0.999 — suggesting that certain role reversals ("The teacher praised the student" / "The student praised the teacher") are seen as relevant pairs even by the reranker. Hedging (0.883) is barely distinguishable from positive controls.
The negative control scores of ~0.0001 with negligible variance again reflect sigmoid saturation: the model's internal logits for unrelated pairs are extremely negative, producing output probabilities clustered near the floating-point floor.
STS-B-RoBERTa-Large (Semantic Similarity, raw 0-5 scale):
| Category | Raw Mean | Raw SD | Raw Min | Raw Max |
|---|---|---|---|---|
| Negation | 0.491 | 0.041 | 0.415 | 0.566 |
| Numerical | 0.454 | 0.068 | 0.310 | 0.628 |
| Entity Swap | 0.837 | 0.189 | 0.343 | 0.972 |
| Temporal | 0.668 | 0.104 | 0.545 | 0.967 |
| Quantifier | 0.563 | 0.130 | 0.339 | 0.893 |
| Hedging | 0.652 | 0.175 | 0.336 | 0.951 |
| Positive Control | 0.889 | 0.100 | 0.611 | 0.971 |
| Negative Control | 0.010 | 0.001 | 0.009 | 0.013 |
An important observation: this model's raw outputs on our data range from 0.01 to 0.97, despite being trained on a 0-5 scale. The positive control mean of 0.889 (raw) is notably lower than the 5.0 that might be expected for true paraphrases. We hypothesize this reflects the nature of our test pairs: they are simple, short sentences (10-15 words) rather than the complex, longer text pairs in STS-B, and the model may calibrate differently for this input distribution. Additionally, the STS-B annotation guidelines define "5" as exact semantic equivalence, which many of our paraphrases do not achieve (e.g., "The cat sat on the mat" / "A feline rested on the rug" changes specific content words).
The key finding is that despite compressed scores, the model does separate adversarial from positive pairs: negation (0.491) and numerical (0.454) are substantially below positive controls (0.889). However, entity swaps (0.837) are close to positive controls, indicating weaker detection of role reversals.
MS-MARCO-MiniLM-L-12-v2 (Retrieval Relevance, raw logits):
| Category | Raw Logit Mean | Raw Logit SD | Raw Min | Raw Max |
|---|---|---|---|---|
| Negation | 8.210 | 0.690 | 6.223 | 9.242 |
| Numerical | 5.831 | 1.962 | -0.427 | 9.504 |
| Entity Swap | 8.999 | 0.674 | 7.440 | 10.567 |
| Temporal | 8.362 | 0.582 | 7.090 | 9.497 |
| Quantifier | 6.621 | 1.517 | 3.233 | 9.460 |
| Hedging | 2.384 | 4.931 | -6.906 | 9.032 |
| Positive Control | 4.051 | 3.122 | -5.091 | 8.336 |
| Negative Control | -11.142 | 0.123 | -11.303 | -10.767 |
This model reveals a critical finding: adversarial pairs receive higher relevance logits than positive controls. Negation pairs (logit 8.21) and entity swaps (logit 9.00) score far above true paraphrases (logit 4.05). After sigmoid, negation pairs map to 0.9994-0.9999 while paraphrases map to 0.006-0.9998 (mean 0.871). The standard deviation of 0.0003 after sigmoid for negation reflects the sigmoid's saturation: logits of 6.2-9.2 all map to probabilities above 0.998, compressing variance.
This is expected behavior for a retrieval model: "The patient does NOT have diabetes" is maximally relevant to a query about "The patient has diabetes" — it directly addresses the topic. The model correctly identifies topical relevance but cannot distinguish agreement from contradiction.
4.3 Aggregate Comparison
We compare the three task-appropriate cross-encoders (STS-B, BGE-reranker, Quora — excluding MS-MARCO retrieval) against bi-encoders. For aggregation, we normalize all scores to [0,1]: STS-B raw divided by 5, BGE and Quora used directly.
| Category | BI Mean (5 models) | CE Mean (3 models) | Mean Difference | Cohen's d | Mann-Whitney p |
|---|---|---|---|---|---|
| Negation | 0.896 | 0.064 | 0.832 | -14.8 | < 10^-69 |
| Numerical | 0.896 | 0.074 | 0.822 | -8.6 | < 10^-67 |
| Entity Swap | 0.987 | 0.201 | 0.786 | -5.6 | < 10^-55 |
| Temporal | 0.953 | 0.081 | 0.871 | -13.8 | < 10^-45 |
| Quantifier | 0.855 | 0.187 | 0.667 | -3.7 | < 10^-31 |
| Hedging | 0.831 | 0.509 | 0.322 | -1.2 | 0.005 |
Regarding the large effect sizes: The Cohen's d values (3.7-14.8) are extreme by social science standards but reflect a genuine phenomenon. Bi-encoder cosine similarities for adversarial pairs cluster tightly around 0.85-0.99 with small standard deviations (0.01-0.09), while cross-encoder scores for the same pairs cluster near 0.02-0.10. These are fundamentally different architectural behaviors on the same inputs — not noisy measurements of the same underlying variable. The distributions have essentially zero overlap for negation and temporal categories, which is precisely what produces extreme Cohen's d values. We provide all individual pair-level scores in our supplementary data for independent verification.
4.4 Failure Rate Comparison
At threshold 0.5, bi-encoders fail on 100% of adversarial pairs — the minimum bi-encoder similarity across all 5 models and all 251 adversarial pairs is 0.602 (an all-MiniLM-L6-v2 hedging pair).
Cross-encoder failure rates at threshold 0.5 (fraction scoring >= 0.5):
| Category | Quora | BGE-reranker | STS-B (norm) | MS-MARCO (sigmoid) |
|---|---|---|---|---|
| Negation | 0% | 0% | 0% | 100% |
| Numerical | 0% | 5% | 0% | 98% |
| Entity Swap | 0% | 33% | 0% | 100% |
| Temporal | 0% | 3% | 0% | 100% |
| Quantifier | 6% | 29% | 0% | 100% |
| Hedging | 52% | 92% | 0% | 72% |
| Positive Ctrl | 83% | 100% | 0% | 77% |
Note the STS-B column: 0% failure rate across ALL categories including positive controls. This does not indicate good performance — it reflects the compressed raw output range (max 0.97/5 = 0.194 normalized). The STS-B model's discrimination exists in the raw score differences (positives at 0.889 vs. negation at 0.491), not in the normalized threshold analysis.
4.5 Per-Category Analysis
Negation (55 pairs): The strongest improvement. Bi-encoder average: 0.896 cosine similarity. Quora cross-encoder: 0.020 duplicate probability. The cross-attention mechanism enables direct comparison of "has" vs. "does not have" across the sentence pair — a comparison impossible when sentences are encoded independently. All 55 pairs in the Quora model scored below 0.17.
Entity Swap (45 pairs): Bi-encoders score entity swaps at 0.987 — higher than paraphrases — because mean pooling is order-invariant: "Google acquired YouTube" and "YouTube acquired Google" contain identical tokens. Cross-encoders reduce this substantially but with model-dependent variance: Quora (0.037), STS-B raw (0.837), BGE (0.398). The Quora model excels because duplicate detection training explicitly penalizes pairs with the same words in different arrangements — exactly the entity-swap pattern.
Numerical (56 pairs): Bi-encoders average 0.896. Cross-encoders average 0.074. All three task-appropriate models consistently detect numerical differences, likely because number tokens receive focused attention when both sentences are processed jointly.
Temporal (35 pairs): Before/after inversions score 0.953 by bi-encoders, reduced to 0.081 by cross-encoders. The temporal markers "before" and "after" occupy the same position in otherwise identical sentences, and cross-attention directly detects this substitution.
Quantifier (35 pairs): Bi-encoders: 0.855. Cross-encoders: 0.187 average, with notable variance. Some quantifier pairs approach 1.0 even for cross-encoders (e.g., BGE scores "All servers are online" / "Some servers are online" at 0.999), suggesting that partial quantifier changes ("all" to "some") are harder to detect than complete reversals ("all" to "none").
Hedging (25 pairs): The persistent failure mode. Bi-encoders: 0.831. Cross-encoders: 0.509 average, with BGE at 0.883. Hedging changes involve distributed, multi-word modifications ("cures" to "may help with some symptoms") that alter certainty rather than polarity. From a reranking or similarity perspective, a hedged version of a claim IS genuinely related to the original claim — these are not contradictions but weakenings. This arguably correct behavior for reranking is wrong for strict duplicate detection.
5. Analysis
5.1 Architecture vs. Training Objective
Our results separate two factors:
- Architectural capacity (independent vs. joint encoding)
- Training objective (what the output score represents)
The MS-MARCO cross-encoder proves that Factor 1 alone is insufficient: despite full cross-attention, it scores adversarial pairs higher than paraphrases because it optimizes for topical relevance. The bi-encoder results prove that Factor 2 alone is insufficient: even models trained on similarity tasks (like STS-B bi-encoders) fail when limited to independent encoding. Success requires both cross-attention AND an objective that rewards distinguishing meaning-altering differences.
5.2 The Hedging Problem
Hedging creates a semantic gradient rather than a binary contrast. "The drug cures cancer" and "The drug may help with some cancer symptoms" are not contradictions — the hedged version is consistent with the definitive version being true. They share entities, domain, and propositional content. The difference is epistemic: certainty level. Current training datasets (STS-B, NLI, Quora) poorly represent these epistemic gradients, and even cross-attention cannot detect a pattern the model was never trained to distinguish.
5.3 Model-Specific Patterns
The Quora model achieves the best overall adversarial robustness. Duplicate detection training creates a conservative classifier: any modification to a sentence (negation, role swap, temporal change) is evidence of non-duplication. This makes it ideal for detecting compositional changes but potentially too conservative for applications where near-duplicates with minor variations should still match.
The BGE reranker occupies a middle ground: excellent on negation and temporal changes where the sentence pairs are clearly about different states of affairs, but more permissive on entity swaps where the same entities are involved (just in different roles) and hedging where the topic is the same.
6. Practical Recommendations
6.1 Hybrid Pipeline Design
The optimal architecture is a two-stage pipeline:
- First stage (bi-encoder): Retrieve candidates using cosine similarity. Fast, scalable, sufficient for topical filtering.
- Second stage (cross-encoder): Re-rank using a task-appropriate cross-encoder to catch compositional errors.
6.2 Model Selection
- For strict duplicate detection: use Quora-style cross-encoders
- For re-ranking: use BGE-style rerankers (but expect entity swap and hedging leakage)
- Avoid MS-MARCO-style models for similarity assessment
- For hedging sensitivity: no current model is reliable; consider dedicated uncertainty classifiers
6.3 Residual Risks
Even with task-appropriate cross-encoders, hedging changes remain undetected. Applications sensitive to certainty level (medical claims, legal assertions, financial advice) require additional mechanisms beyond current cross-encoder models.
7. Limitations
Model coverage: We tested four cross-encoder models. An NLI-trained cross-encoder, which explicitly models contradiction, could show different hedging performance and should be evaluated in future work (the model was inaccessible during our evaluation).
Test set scope: 336 English-only pairs with relatively simple compositional structures. Performance on longer, more complex sentences with nested negation or multiple entity swaps remains untested.
STS-B score compression: The STS-B cross-encoder produced raw outputs in a compressed range (0.01-0.97 on a 0-5 scale), complicating threshold-based comparisons. This may reflect distributional mismatch between our simple test pairs and the STS-B training distribution.
Normalization: Comparing models with different output scales requires normalization decisions that affect threshold-based metrics. We mitigate this by reporting raw scores throughout.
Scale: 336 pairs are sufficient for detecting large effects but may miss subtle interactions between failure modes.
Computational cost: We did not measure inference latency systematically. Cross-encoders are known to be 10-100x slower than bi-encoders for scoring.
8. Related Work
Reimers and Gurevych (2019) introduced Sentence-BERT, establishing the bi-encoder paradigm for efficient sentence similarity. Devlin et al. (2019) introduced BERT, the transformer architecture underlying all models in this study. Nogueira and Cho (2019) demonstrated cross-encoder effectiveness for passage re-ranking, establishing the two-stage pipeline our recommendations build upon. Humeau et al. (2020) proposed poly-encoders as a middle ground between architectures. Ettinger (2020) systematically evaluated what BERT representations capture, finding that negation and other linguistic phenomena challenge pre-trained transformers — a finding we extend to the downstream similarity task across both architectures.
9. Conclusion
We present the first systematic evaluation of cross-encoder robustness to the compositional semantic failures that plague bi-encoder embedding models. Testing four cross-encoder models against five bi-encoders on 336 hand-crafted pairs:
Task-appropriate cross-encoders dramatically reduce failure rates on negation (100% to 0%), entity swaps (100% to 0-33%), numerical changes (100% to 0-5%), and temporal inversions (100% to 0-3%).
Training objective determines effectiveness. A retrieval-trained cross-encoder (MS-MARCO) assigns higher relevance to adversarial pairs than to paraphrases. Cross-attention enables but does not guarantee compositional reasoning.
Hedging remains unsolved. Even the best cross-encoders show 48-92% failure rates on certainty changes — a persistent blind spot representing a semantic gradient rather than binary contrast.
Model selection matters as much as architecture choice. The question is not "bi-encoder or cross-encoder?" but "which cross-encoder, trained on what objective?"
All pair-level scores and raw model outputs are provided for independent verification and reproducibility.
References
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019.
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48.
Humeau, S., Shuster, K., Lachaux, M.-A., and Weston, J. (2020). Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In Proceedings of ICLR 2020.
Nogueira, R. and Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of EMNLP-IJCNLP 2019.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# SKILL.md — Cross-Encoder vs Bi-Encoder Failure Mode Evaluation ## Overview Evaluates whether cross-encoder models fix the compositional semantic failures in bi-encoder embedding models (negation, entity swaps, numerical changes, temporal inversions, quantifier changes, hedging). ## Environment Setup ```bash python3 -m venv .venv_old source .venv_old/bin/activate pip install torch==2.4.0+cpu --index-url https://download.pytorch.org/whl/cpu pip install sentence-transformers==3.0.1 pip install scipy numpy ``` ## Verify Versions ```bash python -c "import torch; print(torch.__version__)" # 2.4.0+cpu python -c "import sentence_transformers; print(sentence_transformers.__version__)" # 3.0.1 ``` ## Reproducing Experiments ### Test Pairs (336 total) Hand-crafted in test_pairs.py: 55 negation, 56 numerical, 45 entity swap, 35 temporal, 35 quantifier, 25 hedging, 35 positive control, 35 negative control, 15 near-miss. ### Bi-Encoder Experiment ```bash python run_v4_experiment.py # Tests 5 models, saves to v4_results/ ``` ### Cross-Encoder Experiment ```bash python run_crossencoder_experiment.py # Tests 4 models, saves to crossencoder/ ``` Note: cross-encoder/nli-roberta-large requires HuggingFace authentication. ### Generate CSV ```bash python generate_csv.py # Produces all_pair_results.csv ``` ## Expected Runtime - Bi-encoder: ~15-20 min (CPU) - Cross-encoder: ~12-15 min (CPU) ## Key Findings - Quora and BGE cross-encoders reduce failure rates from 100% to 0-33% on most categories - MS-MARCO cross-encoder performs worse than bi-encoders (rates adversarial pairs as relevant) - Hedging remains challenging for all models (48-92% failure rate) - Training objective matters more than architecture alone
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.