Statistical Tests for Watermarked Text Detection at Scale
Statistical Tests for Watermarked Text Detection at Scale
1. Introduction
LLM watermarks embed a low-entropy statistical signal in generated text that, in expectation, biases the choice of certain tokens (the green list) over others. Detection then reduces to a hypothesis test: did the observed text come from a watermarked sampler, or from a human (or non-watermarked) source?
The widely used Kirchenbauer test [Kirchenbauer et al. 2023] computes
where is the number of green-list tokens, is the total number of evaluable tokens, and is the green-list fraction. Under the null of i.i.d. token draws this is approximately .
The normal approximation assumes independence — but tokens are not independent in natural text, especially in technical or templated domains. We show this drives a substantial inflation of false positives in practice.
2. Threat Model
A detector receives a span of text and must decide whether it was produced by a watermarked model. We assume:
- The detector knows the hash function and green-list parameters.
- The text may have been mildly edited (re-paraphrased, partial human revision).
- The detector is calibrated for a target false-positive rate of 1% per document of tokens.
3. Method
3.1 Diagnosing inflation
On the C4-Web corpus we observe an empirical Type-I rate of at the nominal 1% threshold (n = 8,192 documents). The inflation is concentrated in (a) code documents and (b) templated boilerplate.
3.2 A block-bootstrap correction
Let . We replace the reference with a block bootstrap over text spans of length :
where is the green-token count under bootstrap replicate . The quantile of becomes the critical value.
3.3 Algorithm
import numpy as np
def block_bootstrap_z(green_indicators, gamma=0.5, ell=32, B=2000, seed=0):
rng = np.random.default_rng(seed)
x = np.asarray(green_indicators, dtype=int)
N = len(x)
n_blocks = N // ell
blocks = x[:n_blocks*ell].reshape(n_blocks, ell)
zs = np.empty(B)
for b in range(B):
idx = rng.integers(0, n_blocks, size=n_blocks)
S = blocks[idx].sum()
zs[b] = (S - gamma * N) / np.sqrt(N * gamma * (1 - gamma))
return zs4. Results
4.1 Type-I control
| Corpus | Canonical z (1% nominal) | Block-bootstrap |
|---|---|---|
| C4-Web | 6.4% | 1.1% |
| arXiv-LaTeX | 4.9% | 1.0% |
| GitHub-Python | 7.8% | 1.3% |
| Reuters-news | 1.6% | 1.0% |
4.2 Detection power
On 1,024 watermarked completions of average length 412 tokens, our test detects at vs. for the canonical z-test, a statistically insignificant difference (, McNemar's test).
4.3 Cost
One additional pass over the token stream and bootstrap aggregations cost a median 18 ms per document on a single CPU thread.
5. Discussion and Limitations
The block bootstrap assumes weakly stationary dependence, which can fail at section boundaries (e.g. a paper mixing prose and code). We did not evaluate adversarial paraphrasing attacks; recent work [Krishna et al. 2024] suggests these can erase the green-list signal entirely, in which case no test based on green-list frequency will recover power. Finally, our bootstrap is only valid when .
6. Conclusion
A dependence-aware reference distribution restores valid Type-I control for green-list watermark detectors at negligible cost in detection power. We recommend it as a drop-in replacement for the canonical z-test in production deployments.
References
- Kirchenbauer, J. et al. (2023). A Watermark for Large Language Models.
- Krishna, K. et al. (2024). Paraphrasing Evades Detectors of AI-Generated Text.
- Politis, D. and Romano, J. (1994). The Stationary Bootstrap.
- Aaronson, S. (2023). Watermarks via Pseudorandom Functions.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.