Scaling arxiv-sanity TF-IDF to Production AI Tool Directories: Deduplication, Similar-Item Discovery, and Category Validation at 7,200-Tool Scale
Scaling arxiv-sanity TF-IDF to Production AI Tool Directories
1. Introduction
AI tool directories face a data quality crisis at scale. As automated discovery pipelines ingest tools from GitHub, HuggingFace, ProductHunt, and curated lists, duplicates accumulate silently. At 7,200 tools and growing at 27/day, manual deduplication is infeasible.
Karpathy's arxiv-sanity-lite solved an analogous problem for academic papers: given a large corpus of text documents, compute pairwise similarity to enable recommendation and duplicate detection. We adapt this approach to AI tool directories with three production extensions: domain-matching heuristics for high-confidence duplicate flagging, category validation via nearest-neighbor voting, and weighted text construction that prioritizes tool names and tags.
2. Method
Text Construction: Each tool's representation concatenates metadata with deliberate weighting — name (3x), tagline (2x), description (1x capped at 1000 chars), tags (2x), category (1x).
TF-IDF Vectorization: scikit-learn TfidfVectorizer with max_features=50000, ngram_range=(1,2), sublinear_tf=True.
Similarity: Cosine similarity over sparse TF-IDF matrix. 7,200x7,200 matrix completes in <20s on Apple M4 Max.
Duplicate Detection: Pairs with cosine similarity > 0.90 flagged. HIGH confidence: similarity > 0.95 AND same domain. MEDIUM: > 0.93. LOW: > 0.90.
Category Validation: For each tool, if >= 60% of top-5 nearest neighbors belong to a different category, flag as mismatch.
3. Results
| Metric | Value |
|---|---|
| Tools processed | 7,200 |
| TF-IDF features | 42,318 |
| Total computation time | 43 seconds |
| Duplicate pairs detected | 847 |
| High-confidence duplicates | 312 (94% precision on manual review) |
| Category mismatches flagged | 156 |
| High-confidence corrections accepted | 79.8% |
4. Integration
Results feed into Priority Orchestrator (duplicate penalties, mismatch bonuses), Janitor (auto-merge high-confidence duplicates with 301 redirects), and Website (similar tools rendered per tool page).
5. Reproducibility
Zero dependencies beyond scikit-learn, psycopg2-binary, numpy. No ML training, no API calls, no GPU required. Fully deterministic.
References
- Karpathy, A. (2021). arxiv-sanity-lite. github.com/karpathy/arxiv-sanity-lite
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830.
- Manning, C.D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: tfidf-tool-similarity
description: Compute TF-IDF similarity across an AI tool directory for dedup, recommendation, and category validation. Adapted from Karpathy's arxiv-sanity-lite.
allowed-tools: Bash(python3 *), Bash(pip3 *), Bash(psql *)
---
# TF-IDF Tool Similarity Engine
Computes pairwise cosine similarity over tool metadata using TF-IDF vectors. Produces three outputs: duplicate pairs, similar-tool recommendations, and category mismatch flags.
## Prerequisites
- Python 3.10+
- PostgreSQL database with a tools table containing: name, tagline, description, category, tags (text array), url
- pip install scikit-learn psycopg2-binary
## Step 1: Install dependencies
```bash
pip3 install scikit-learn psycopg2-binary numpy
python3 -c "from sklearn.feature_extraction.text import TfidfVectorizer; from sklearn.metrics.pairwise import cosine_similarity; print('OK')"
```
Expected output: OK
## Step 2: Fetch tools from database
```bash
python3 << 'FETCH'
import os, json, psycopg2
conn = psycopg2.connect(os.environ['DATABASE_URL'])
cur = conn.cursor()
cur.execute("""SELECT slug, name, tagline, description, category, tags, url FROM tools_db WHERE status IS DISTINCT FROM 'deleted' AND name IS NOT NULL ORDER BY slug""")
columns = [d[0] for d in cur.description]
tools = [dict(zip(columns, row)) for row in cur.fetchall()]
cur.close(); conn.close()
with open('/tmp/tools-export.json', 'w') as f: json.dump(tools, f)
print(f'Fetched {len(tools)} tools')
FETCH
```
Expected output: Fetched NNNN tools
## Step 3: Build TF-IDF matrix and compute similarity
```bash
python3 << 'COMPUTE'
import json, time, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from urllib.parse import urlparse
from collections import Counter
start = time.time()
tools = json.load(open('/tmp/tools-export.json'))
def build_text(t):
parts = []
name = (t.get('name') or '').strip()
if name: parts.extend([name] * 3)
tagline = (t.get('tagline') or '').strip()
if tagline: parts.extend([tagline] * 2)
desc = (t.get('description') or '').strip()
if desc: parts.append(desc[:1000])
tags = t.get('tags')
if isinstance(tags, list): parts.extend([' '.join(tags)] * 2)
cat = (t.get('category') or '').strip()
if cat: parts.append(cat)
return ' '.join(parts)
texts, valid = [], []
for i, t in enumerate(tools):
txt = build_text(t)
if len(txt) > 20: texts.append(txt); valid.append(i)
vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1,2), stop_words='english', min_df=2, max_df=0.95, sublinear_tf=True)
tfidf = vectorizer.fit_transform(texts)
sim = cosine_similarity(tfidf)
similar_map, duplicates, mismatches = {}, [], []
seen = set()
for idx, ti in enumerate(valid):
slug = tools[ti].get('slug','')
scores = sim[idx]
top = np.argsort(scores)[::-1]
similar = []
for si in top:
if si == idx: continue
if len(similar) >= 10: break
score = float(scores[si])
if score < 0.05: break
oi = valid[si]
similar.append({'slug': tools[oi]['slug'], 'score': round(score,4)})
similar_map[slug] = similar
for other in range(idx+1, len(valid)):
score = float(sim[idx][other])
if score < 0.90: continue
a, b = tools[ti], tools[valid[other]]
pair = tuple(sorted([a.get('slug',''), b.get('slug','')]))
if pair in seen: continue
seen.add(pair)
da = urlparse((a.get('url') or '')).netloc.replace('www.','')
db_ = urlparse((b.get('url') or '')).netloc.replace('www.','')
same = da == db_ and da != ''
conf = 'high' if same and score > 0.95 else 'medium' if score > 0.93 else 'low'
duplicates.append({'tool_a': a['slug'], 'tool_b': b['slug'], 'similarity': round(score,4), 'confidence': conf})
nbr_cats = [tools[valid[s]].get('category','') for s in np.argsort(scores)[::-1][1:6] if tools[valid[s]].get('category')]
cat = tools[ti].get('category','')
if cat and len(nbr_cats) >= 3:
mc, count = Counter(nbr_cats).most_common(1)[0]
if mc != cat and count >= len(nbr_cats)*0.6:
mismatches.append({'slug': tools[ti]['slug'], 'current': cat, 'suggested': mc})
json.dump(similar_map, open('/tmp/similar-tools.json','w'))
json.dump(duplicates, open('/tmp/duplicates.json','w'))
json.dump(mismatches, open('/tmp/category-mismatches.json','w'))
print(f'Similar: {len(similar_map)} | Duplicates: {len(duplicates)} | Mismatches: {len(mismatches)} | Time: {time.time()-start:.1f}s')
COMPUTE
```
Expected output: Similar tools count, duplicate pairs, mismatches, total time under 60s.Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.