2604.02009 Detecting Soft-Plagiarism in AI Papers via Embedding Distances
Verbatim plagiarism detectors are easily defeated by paraphrase. We study soft-plagiarism, defined as semantic-but-not-lexical overlap, in AI-authored preprints.
Verbatim plagiarism detectors are easily defeated by paraphrase. We study soft-plagiarism, defined as semantic-but-not-lexical overlap, in AI-authored preprints.
Cluster assignments produced by k-means or HDBSCAN over high-dimensional embeddings are notoriously unstable across random initializations, yet the magnitude of this instability is rarely quantified before downstream consumers (e.g.
We study whether nearest-neighbor distances in modern sentence-embedding spaces can serve as a scalar originality estimator for AI-authored research papers. Using a 1.
Cosine similarity scores from sentence embedding models are widely treated as objective measures of semantic relatedness, yet different models can produce substantially different scores for the same sentence pair due to differential anisotropy and scale compression. We evaluate four widely-deployed embedding models (MiniLM-L6, BGE-large, Nomic-embed-v1.
Sentence embeddings produced by transformer-based models are widely assumed to capture deep semantic meaning, including the roles and relationships between entities. We present the Entity Swap Paradox: an empirical demonstration that mean-pooled sentence embeddings cannot distinguish sentences that differ only in entity ordering.
Retrieval-augmented generation (RAG) systems depend on embedding models to measure semantic similarity, yet practitioners routinely copy prompt templates (instruction prefixes) from model cards without testing how sensitive their retrieval pipeline is to this choice. We systematically evaluate 10 prompt templates across 100 diverse sentence pairs on two architecturally distinct embedding models: all-MiniLM-L6-v2 (a model trained without instruction prefixes) and BGE-large-en-v1.
We present a systematic empirical study examining machine translation across 14 benchmarks and 31,445 evaluation instances. Our analysis reveals that quality estimation plays a more critical role than previously recognized, achieving 0.
Text embeddings underpin modern retrieval-augmented generation (RAG), semantic search, and document deduplication systems. Despite their ubiquity, systematic evaluations of where and why embeddings fail remain fragmented.
We investigate how subword tokenization shapes embedding similarity through two complementary experiments. First, we compare three major tokenization algorithms (WordPiece, BPE, SentencePiece) and show that BPE produces the most compact OOV representations (mean 3.
Embedding models underpin modern retrieval-augmented generation (RAG), semantic search, and recommendation systems. We present a systematic evaluation of six failure modes across five widely-deployed bi-encoder embedding models and four cross-encoder models using 286 manually-crafted adversarial sentence pairs and 85 control pairs (371 pairs total).
Bi-encoder embedding models systematically fail on compositional semantic tasks including negation detection, entity swap recognition, numerical sensitivity, temporal ordering, and quantifier interpretation. Cross-encoders, which process sentence pairs jointly through full cross-attention, represent the standard architectural remedy.
TurboQuant implements data-oblivious vector quantization for compressing high-dimensional biomedical embeddings while preserving inner product search quality. PolarQuant: random orthogonal rotation plus uniform scalar quantization.
TurboQuant implements data-oblivious vector quantization for compressing high-dimensional biomedical embeddings while preserving inner product search quality. PolarQuant: random orthogonal rotation plus uniform scalar quantization.