Detecting Plagiarism Among Generated Manuscripts at Scale in AI-Friendly Archives
Detecting Plagiarism Among Generated Manuscripts
1. Introduction
A distinctive challenge for AI-friendly archives is the phenomenon of semantic-collision plagiarism: two authors who never meet, prompting similar LLMs with similar instructions, may produce text that overlaps far beyond what classical plagiarism detectors expect. The overlaps arise not from copying but from the relative paucity of distinct phrasings in the model's high-likelihood region for a given prompt.
Semantic-collision plagiarism poses two problems. First, archives must decide whether such collisions count as misconduct (we argue not, but they still warrant disclosure). Second, current near-duplicate detectors --- TurnitIn, JPlag, and academic SimHash systems --- are tuned to detect substring overlap and are poorly matched to convergent generation.
This paper:
- Formalizes semantic-collision plagiarism.
- Builds a labeled benchmark of 12{,}450 paragraph pairs.
- Proposes SCOLLIDE, a two-stage detector.
- Reports cost-quality trade-offs at archive scale.
2. Threat Model
Let and be two manuscripts submitted independently. We consider three regimes:
- R1 (intentional copy): 's author copies .
- R2 (laundered copy): 's author copies then paraphrases via LLM.
- R3 (convergent generation): both authors prompt similar LLMs with similar prompts; no copying.
A detector that conflates R3 with R1 or R2 is a defamation hazard. Our framework requires not only a similarity score but a signal indicating which regime is most likely.
3. Method: SCOLLIDE
3.1 Stage 1: SimHash candidate retrieval
For each paragraph , we compute a 256-bit SimHash over 5-gram token shingles. Submission triggers an Hamming-radius-12 lookup in an LSH index over the existing archive. We retain candidates with Hamming distance .
3.2 Stage 2: Cross-encoder rerank
Candidates are reranked by a fine-tuned cross-encoder . The encoder is trained on 36k labeled pairs with three classes: independent (49%), R1/R2-copy (28%), R3-collision (23%).
3.3 Regime classifier
A second head outputs a posterior over . The regime label is published alongside the similarity score so editors can act appropriately.
def scollide_score(para, archive_index, ce, regime_head):
cands = archive_index.lsh_lookup(simhash(para), max_hamming=24)
out = []
for c in cands:
s = ce(para, c.text)
r = regime_head(para, c.text)
out.append({"id": c.id, "sim": s, "regime": r})
return out4. Benchmark
Construction. Starting from 4{,}200 published abstracts, we generated three derivative variants per abstract: (a) direct copy with light edits, (b) paraphrased copy, (c) independent regeneration from a similar prompt. We augmented with 4{,}200 truly independent pairs, yielding 12{,}450 labeled pairs.
Inter-annotator agreement. Two reviewers rated a 600-pair subsample for regime; .
5. Results
5.1 Detection performance
| Detector | Precision | Recall | F1 | s/pair |
|---|---|---|---|---|
| TurnitIn-style n-gram | 0.97 | 0.41 | 0.58 | 110 |
| JPlag-adapted | 0.93 | 0.46 | 0.62 | 220 |
| SCOLLIDE Stage 1 only | 0.81 | 0.92 | 0.86 | 90 |
| SCOLLIDE full | 0.94 | 0.88 | 0.91 | 318 |
5.2 Regime confusion matrix
| Pred R1/R2 | Pred R3 | Pred Indep. | |
|---|---|---|---|
| True R1/R2 | 0.86 | 0.10 | 0.04 |
| True R3 | 0.12 | 0.79 | 0.09 |
| True Indep. | 0.02 | 0.06 | 0.92 |
Notably, SCOLLIDE distinguishes R3 from R1/R2 at 79% accuracy, supporting an editorial workflow that escalates only the R1/R2 regime.
5.3 Archive-scale projection
At 318 s/pair and an average of 23 candidates returned per submission, the per-submission cost is under 8 ms even at an archive size of paragraphs.
6. Discussion
The most important finding is the prevalence of R3 collisions in our convergent-prompt simulations: 14% of independently generated abstracts hit a Hamming-12 collision with at least one other generated abstract. Treating these as misconduct would be both wrong and operationally unsustainable.
We therefore recommend that AI-friendly archives publish similarity reports with regime labels, leaving the misconduct judgment to humans with access to author intent.
7. Limitations
Our regime classifier was trained on synthetic data; a field study of real submissions remains future work. The benchmark is English-only and over-represents ML topics. SimHash is robust to small edits but not to large structural rewrites; a third stage with structural alignment is a natural extension.
8. Conclusion
Semantic-collision plagiarism is a real and growing failure mode that classical detectors mishandle. SCOLLIDE provides scale-appropriate detection with explicit regime disambiguation. We release the benchmark and a reference implementation.
References
- Manber, U. (1994). Finding Similar Files in a Large File System.
- Charikar, M. (2002). Similarity Estimation Techniques from Rounding Algorithms.
- Foltynek, T. et al. (2023). Testing Plagiarism Detection in the Era of LLMs.
- Reimers, N. and Gurevych, I. (2019). Sentence-BERT.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.