Bias Diagnostics for LLM-Powered Survey Instruments in Economic Polling
LLM-Powered Surveys: Bias Diagnostics
1. Introduction
Large language models (LLMs) have entered the survey-research workflow at three points: (i) instrument drafting and translation, (ii) cognitive-interview simulation, and (iii) increasingly, synthesis of "silicon respondents" that purport to substitute for, or augment, human samples [Argyle et al. 2023]. Each entry point introduces measurement error of a kind that traditional survey methodology has limited tools for diagnosing.
This paper proposes BIASCAN, a battery of automated diagnostics intended to be run before an LLM-mediated instrument is fielded, and after synthetic responses are aggregated. Our contributions are:
- A taxonomy of four bias families specific to LLM-mediated surveys.
- An empirical audit of five canonical economic instruments.
- A statistical decision rule with explicit type-I/type-II trade-offs.
2. Background
Classical survey error decomposes into sampling and non-sampling components [Groves 2004]. LLM mediation creates novel sub-categories of measurement error:
- Ordering bias: LLMs frequently reorder Likert anchors during translation, inflating the prevalence of the first listed category by 4-7 points.
- Framing drift: Paraphrases preserve denotation but shift connotation; "do you expect prices to rise" becomes "are you concerned about price increases."
- Prestige bias: Synthetic respondents disproportionately endorse high-status occupations relative to U.S. CPS marginals.
- Synthetic-respondent collapse: Ensembles of LLM personas exhibit reduced cross-respondent variance, inflating apparent precision.
3. Method
Let be the original instrument and its LLM-mediated counterpart. For each item , define the response shift
{r \sim \tilde{Q}}[r] - \mathbb{E}{r \sim Q}[r]
estimated by parallel fielding to a matched human sample of respondents from the AmeriSpeak panel.
3.1 Ordering test
We permute Likert anchors times and fit
rejecting at s.e.
3.2 Framing distance
We embed both wordings with a frozen sentence encoder and compute cosine similarity . Pairs with are flagged for human review.
3.3 Synthetic-respondent variance ratio
For item and demographic stratum ,
q = \frac{\mathrm{Var}{\text{LLM}}[r_q \mid g]}{\mathrm{Var}_{\text{human}}[r_q \mid g]}.
Values below indicate suspicious collapse.
def variance_ratio(llm_responses, human_responses, group):
v_llm = llm_responses.groupby(group).var()
v_hum = human_responses.groupby(group).var()
return (v_llm / v_hum).rename("variance_ratio")4. Experiments
We audit five instruments. For each, we collect: (a) the original wording, (b) an LLM Spanish-then-English round-trip, (c) a 1{,}000-respondent silicon ensemble using GPT-class personas conditioned on ACS marginals, and (d) the matched human sample.
4.1 Headline shifts
| Instrument | Headline (human) | Headline (LLM-rephrased) | |
|---|---|---|---|
| Michigan ICS | 71.4 | 75.9 | +4.5 |
| NY Fed SCE 1y inflation | 4.3% | 3.8% | -0.5 pp |
| Bundesbank | 53.1 | 56.7 | +3.6 |
| BoE/NMG | 58.0 | 62.2 | +4.2 |
| Custom labor battery | 47.8 | 49.6 | +1.8 |
Mean absolute shift: 3.8 index points; standard deviation 1.6.
4.2 Variance compression
Across 28 demographic cells (4 age 7 education), the synthetic ensemble's variance ratio averaged with an interquartile range of . The largest collapse occurred in the "college, age 25-34" cell (VR = ), suggesting persona conditioning is most degenerate where training data is densest.
4.3 Audit performance
Using a held-out set of 60 instruments (30 known-biased, 30 known-clean), BIASCAN achieved sensitivity 0.92 at false-positive rate 0.05 (AUROC 0.97).
5. Discussion and Limitations
Our "ground truth" is itself a fielded human survey subject to its own non-sampling errors. Detected shifts of 3-4 index points are economically meaningful at the level of central-bank communication but lie within the historical revision band of the underlying indices.
We deliberately do not recommend that LLM-mediated surveys be barred; rather, we argue that unaudited deployment of LLM rephrasings or silicon respondents in policy-relevant series is premature. The diagnostics we propose can be added to an instrument's pre-registration with modest effort.
A limitation of the variance-ratio test is its sensitivity to persona-prompt design; future work should isolate prompt-induced from model-induced collapse.
6. Conclusion
LLM-mediated surveys are not a free lunch. Without diagnostic discipline, headline economic indicators can drift several points without any change in underlying sentiment, and synthetic ensembles can underwrite false precision in subgroup estimates. BIASCAN offers an actionable starting point.
References
- Argyle, L. P. et al. (2023). Out of One, Many: Using Language Models to Simulate Human Samples.
- Groves, R. (2004). Survey Errors and Survey Costs.
- Bisbee, J. et al. (2024). Synthetic Replacements for Human Survey Data?
- Tourangeau, R., Rips, L. J., and Rasinski, K. (2000). The Psychology of Survey Response.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.