TF-IDF Tool Similarity Engine for Large-Scale AI Directory Deduplication
name: tfidf-tool-similarity version: 2.0.0 supersedes: "2603.00337" authors:
- ai@aiindigo.com
- contact@aiindigo.com source: ~/services/simulation/jobs/compute-similarity.py description: Compute TF-IDF cosine similarity across a PostgreSQL tool directory. Produces duplicate pairs, top-10 similar tools per item, and category mismatch flags. Adapted from Karpathy's arxiv-sanity-lite. allowed-tools: Bash(python3 *), Bash(pip3 *)
Supersedes: 2603.00337 — This version replaces the Claw4S conference submission with verified source code, real production metrics, and corrected claims. Contact: ai@aiindigo.com · contact@aiindigo.com
TF-IDF Tool Similarity Engine
Runs nightly on the AI Indigo simulation (Mac Studio M4 Max, 128 GB). Reads 6,531 tools from PostgreSQL, builds a TF-IDF matrix, computes pairwise cosine similarity, and outputs three JSON files consumed by the Priority Orchestrator and Janitor.
What this produces
/tmp/similarity-results/similar-tools.json— top-10 similar tools per slug/tmp/similarity-results/duplicates.json— pairs with similarity > 0.90/tmp/similarity-results/category-mismatches.json— tools whose neighbors disagree with their category at > 70% agreementtools_db.similar_tools(JSONB column) — updated in PostgreSQL
Prerequisites
pip3 install scikit-learn psycopg2-binary numpy
export DATABASE_URL="postgresql://..." # Neon or any PostgresStep 1: Fetch tools from PostgreSQL
python3 << 'FETCH'
import os, json, psycopg2
conn = psycopg2.connect(os.environ['DATABASE_URL'])
cur = conn.cursor()
cur.execute("""
SELECT slug, name, tagline, description, category, tags, url
FROM tools_db
WHERE status IS DISTINCT FROM 'deleted'
AND name IS NOT NULL
ORDER BY slug
""")
columns = [d[0] for d in cur.description]
tools = [dict(zip(columns, row)) for row in cur.fetchall()]
cur.close()
conn.close()
with open('/tmp/tools-export.json', 'w') as f:
json.dump(tools, f)
print(f'Fetched {len(tools)} tools')
FETCHExpected output: Fetched 6531 tools (or your current count)
Step 2: Build weighted text per tool
Text is constructed by repeating fields with deliberate weights so tool names dominate over descriptions:
def build_tool_text(tool):
parts = []
name = (tool.get('name') or '').strip()
if name:
parts.extend([name] * 3) # name 3x — strongest identity signal
tagline = (tool.get('tagline') or '').strip()
if tagline:
parts.extend([tagline] * 2) # tagline 2x — concise differentiator
desc = (tool.get('description') or '').strip()
if desc:
parts.append(desc[:1000]) # description 1x, capped
tags = tool.get('tags')
if isinstance(tags, list):
parts.extend([' '.join(str(t) for t in tags)] * 2) # tags 2x
category = (tool.get('category') or '').strip()
if category:
parts.append(category)
return ' '.join(parts)Step 3: Run TF-IDF + cosine similarity
python3 << 'COMPUTE'
import json, time, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from urllib.parse import urlparse
start = time.time()
tools = json.load(open('/tmp/tools-export.json'))
def build_tool_text(tool):
parts = []
name = (tool.get('name') or '').strip()
if name: parts.extend([name] * 3)
tagline = (tool.get('tagline') or '').strip()
if tagline: parts.extend([tagline] * 2)
desc = (tool.get('description') or '').strip()
if desc: parts.append(desc[:1000])
tags = tool.get('tags')
if isinstance(tags, list): parts.extend([' '.join(str(t) for t in tags)] * 2)
category = (tool.get('category') or '').strip()
if category: parts.append(category)
return ' '.join(parts)
texts, valid = [], []
for i, t in enumerate(tools):
txt = build_tool_text(t)
if len(txt) > 20:
texts.append(txt)
valid.append(i)
print(f'{len(texts)} tools with sufficient text (out of {len(tools)})')
# These are the exact parameters used in production
vectorizer = TfidfVectorizer(
max_features=50000,
ngram_range=(1, 2), # unigrams + bigrams
stop_words='english',
min_df=2, # term must appear in >=2 tools
max_df=0.95, # ignore near-universal terms (e.g. "AI", "tool")
sublinear_tf=True, # log normalization, from Karpathy's implementation
)
tfidf = vectorizer.fit_transform(texts)
print(f'TF-IDF matrix: {tfidf.shape}')
sim = cosine_similarity(tfidf)
print(f'Similarity matrix: {sim.shape} computed in {time.time()-start:.1f}s')
# Extract top-10 similar per tool
similar_map = {}
for idx, ti in enumerate(valid):
slug = tools[ti].get('slug', '')
top = np.argsort(sim[idx])[::-1]
similar = []
for si in top:
if si == idx: continue
if len(similar) >= 10: break
score = float(sim[idx][si])
if score < 0.05: break
oi = valid[si]
similar.append({'slug': tools[oi]['slug'], 'name': tools[oi]['name'], 'score': round(score, 4), 'category': tools[oi].get('category', '')})
similar_map[slug] = similar
# Detect duplicates (threshold = 0.90, same as production DUPLICATE_THRESHOLD)
duplicates = []
seen = set()
for idx, ti in enumerate(valid):
for other in range(idx+1, len(valid)):
score = float(sim[idx][other])
if score < 0.90: continue
a, b = tools[ti], tools[valid[other]]
pair = tuple(sorted([a.get('slug',''), b.get('slug','')]))
if pair in seen: continue
seen.add(pair)
try:
da = urlparse((a.get('url') or '')).netloc.replace('www.', '')
db_ = urlparse((b.get('url') or '')).netloc.replace('www.', '')
same = da == db_ and da != ''
except: same = False
conf = 'high' if same and score > 0.95 else 'medium' if score > 0.93 else 'low'
duplicates.append({'tool_a': a['slug'], 'tool_b': b['slug'], 'similarity': round(score, 4), 'same_domain': same, 'confidence': conf})
duplicates.sort(key=lambda d: d['similarity'], reverse=True)
# Detect category mismatches (threshold = 0.70, same as production)
mismatches = []
for idx, ti in enumerate(valid):
cat = tools[ti].get('category', '')
if not cat: continue
top5 = np.argsort(sim[idx])[::-1][1:6]
nbr_cats = [tools[valid[s]].get('category','') for s in top5 if tools[valid[s]].get('category')]
if len(nbr_cats) < 3: continue
mc, count = Counter(nbr_cats).most_common(1)[0]
if mc != cat and count >= len(nbr_cats) * 0.70:
mismatches.append({'slug': tools[ti]['slug'], 'current': cat, 'suggested': mc, 'agreement': f'{count}/{len(nbr_cats)}', 'confidence': 'high' if count >= 4 else 'medium'})
import os
os.makedirs('/tmp/similarity-results', exist_ok=True)
json.dump(similar_map, open('/tmp/similarity-results/similar-tools.json','w'), indent=2)
json.dump(duplicates, open('/tmp/similarity-results/duplicates.json','w'), indent=2)
json.dump(mismatches, open('/tmp/similarity-results/category-mismatches.json','w'), indent=2)
duration = time.time() - start
print(f'\nResults:')
print(f' similar-tools.json: {len(similar_map)} entries')
print(f' duplicates.json: {len(duplicates)} pairs')
print(f' category-mismatches.json: {len(mismatches)} flags')
print(f' Total time: {duration:.1f}s')
COMPUTEExpected output on a 6,500-tool corpus: computation completes in under 60 seconds on modern hardware. The production run on Mac Studio M4 Max completes the full matrix in ~43 seconds.
Step 4: Verify results
python3 << 'CHECK'
import json
dupes = json.load(open('/tmp/similarity-results/duplicates.json'))
print(f'Top 5 duplicate pairs:')
for d in dupes[:5]:
flag = '(same domain)' if d['same_domain'] else ''
print(f' {d["tool_a"]} <-> {d["tool_b"]} — {d["similarity"]} [{d["confidence"]}] {flag}')
mm = json.load(open('/tmp/similarity-results/category-mismatches.json'))
print(f'\nTop 5 category mismatches:')
for m in mm[:5]:
print(f' {m["slug"]}: {m["current"]} → {m["suggested"]} ({m["agreement"]})')
sim = json.load(open('/tmp/similarity-results/similar-tools.json'))
sample = list(sim.keys())[:2]
print(f'\nSimilar tools (sample):')
for slug in sample:
top3 = sim[slug][:3]
print(f' {slug}: {[f"{s[\"slug\"]} ({s[\"score\"]})" for s in top3]}')
CHECKStep 5: Update PostgreSQL (optional)
python3 << 'UPDATE'
import os, json, psycopg2
sim = json.load(open('/tmp/similarity-results/similar-tools.json'))
conn = psycopg2.connect(os.environ['DATABASE_URL'])
cur = conn.cursor()
cur.execute("ALTER TABLE tools_db ADD COLUMN IF NOT EXISTS similar_tools JSONB DEFAULT '[]'")
updated = 0
for slug, similar in sim.items():
cur.execute("UPDATE tools_db SET similar_tools = %s WHERE slug = %s", (json.dumps(similar), slug))
updated += 1
conn.commit()
cur.close()
conn.close()
print(f'Updated {updated} rows')
UPDATENotes
- This is the exact script running in production at
~/services/simulation/jobs/compute-similarity.py - Scheduled nightly in the simulation's
periodic-scheduler.js(intervalMin: 1440) - Results feed into the Priority Orchestrator (G24) — duplicate pairs become penalties, mismatches become bonuses
- The
CATEGORY_MISMATCH_THRESHOLDin production is 0.70 (not 0.60 as in some drafts) - The
similar_toolsJSONB column is served by the website's "Similar Tools" feature on each tool page
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.