A Permutation Test for Embedding-Cluster Stability under Random Restarts
A Permutation Test for Embedding-Cluster Stability under Random Restarts
1. Introduction
A common workflow: embed a corpus, cluster the embeddings, present the resulting clusters as topics, themes, or retrieval shards. The workflow has a quiet failure mode. Cluster assignments depend on an initialization seed, and on real-world embedding distributions, two seeds can yield dramatically different partitions while individually exhibiting healthy silhouette scores.
We propose a hypothesis test that asks the right question: given two clusterings of the same data under different seeds, is their agreement larger than chance? Existing stability heuristics — silhouette, gap statistic, or bare ARI — answer adjacent questions but not this one.
2. Setup
Let be a set of -dimensional embeddings. Let be two clusterings produced by the same algorithm under two different seeds. We define
where ranges over label permutations . Note that ARI is already adjusted for chance under a uniform label model; our null is stronger, conditioning on the marginal cluster sizes of and .
3. Computing the Null
For and we compute the exact conditional null distribution of via dynamic programming over contingency tables with fixed margins, in time . For larger problems we sample permutations and compute a Monte Carlo p-value.
The test rejects when the observed exceeds the quantile of the null. To assess overall stability of an algorithm on a dataset, we run seeds and require that all pairwise tests pass at level (Bonferroni).
4. Background and Related Work
Lange et al. (2004) [Lange et al. 2004] proposed bootstrap-based stability scores; Ben-Hur et al. (2002) used pairwise overlap of resampled clusterings. These are descriptive but lack a calibrated p-value. Our framing — a permutation test against a contingency-table-conditional null — is, to our knowledge, the first to admit Type-I error control on this question.
5. Experiments
We evaluate on four embedding corpora:
- MTEB-mini (n=12,000, d=768): a subset of the MTEB benchmark embedded with
bge-base. - arXiv-CS (n=46,000, d=1024): paper-abstract embeddings.
- News-2024 (n=8,200, d=384): news article embeddings.
- Synth-3-blob (n=1,000, d=64): three Gaussian blobs with overlap; serves as a positive control.
For each corpus we run -means (k=10, 25, 50) and HDBSCAN (min_cluster_size=20) under 30 seeds.
| Corpus | Algorithm | k | Mean silhouette | Pass our test ()? |
|---|---|---|---|---|
| MTEB-mini | kmeans | 25 | 0.18 | No (p=0.31) |
| MTEB-mini | kmeans | 50 | 0.14 | No (p=0.42) |
| arXiv-CS | kmeans | 50 | 0.21 | Yes (p=0.003) |
| arXiv-CS | HDBSCAN | - | 0.34 | No (p=0.18) |
| News-2024 | kmeans | 10 | 0.27 | Yes (p=0.001) |
| Synth-3-blob | kmeans | 3 | 0.71 | Yes (p<0.001) |
Aggregating across all 18 (corpus, algorithm, k) cells with passing silhouette (), 38% fail our test, indicating that silhouette-stable solutions can still be permutation-unstable.
def permutation_pvalue(c1, c2, B=10000, rng=None):
rng = rng or np.random.default_rng()
obs = adjusted_rand_score(c1, c2)
null = np.empty(B)
for b in range(B):
c2_perm = rng.permutation(c2)
null[b] = adjusted_rand_score(c1, c2_perm)
return float((null >= obs).mean())6. Limitations
The test's null treats labels as exchangeable; it does not condition on the embeddings themselves. A more refined null would resample from a posited generative model of embeddings, but this introduces model-misspecification risk that the permutation framing avoids.
The test is conservative when cluster sizes are extremely imbalanced (one cluster contains of points): the null distribution becomes degenerate. We recommend reporting marginal cluster sizes alongside the p-value.
7. Conclusion
A practical permutation test for embedding-cluster stability fits in a screen of code and provides a calibrated answer to a question that practitioners already implicitly ask. We recommend running it before any clustering output is exposed to a downstream consumer.
References
- Lange, T. et al. (2004). Stability-based validation of clustering solutions.
- Ben-Hur, A. et al. (2002). A stability based method for discovering structure in clustered data.
- Hubert, L., & Arabie, P. (1985). Comparing partitions.
- Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
- Campello, R. J. G. B. et al. (2013). Density-based clustering based on hierarchical density estimates.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.