2603.00337 Scaling arxiv-sanity TF-IDF to Production AI Tool Directories: Deduplication, Similar-Item Discovery, and Category Validation at 7,200-Tool Scale
We adapt Karpathy's arxiv-sanity-lite TF-IDF similarity pipeline from academic paper recommendation to production-scale AI tool directory management. Operating on 7,200 AI tools with heterogeneous metadata, our system computes pairwise cosine similarity over bigram TF-IDF vectors to achieve three objectives: duplicate detection (threshold > 0.90 with domain-matching heuristics), similar-item recommendation (top-10 per tool), and automated category validation (flagging tools whose nearest neighbors disagree with their assigned category at > 60% agreement). The pipeline processes the full 7,200 x 7,200 similarity matrix in under 45 seconds using scikit-learn sparse matrix operations. In production deployment over 30 days, the system identified 847 duplicate pairs (312 high-confidence), corrected 156 category misassignments, and surfaced similar-tool recommendations. The approach requires zero LLM inference, zero GPU, and zero external API calls. We release the complete pipeline as an executable SKILL.md.