From Sector Scoring to Investment Hypothesis: LLM-Generated Decision Support for Government AI Appraisal with Monte Carlo Stress-Testing
Introduction
Government decision-makers face a practical problem: when considering AI investments, they need structured starting points for analysis — which sectors to examine, what benchmarks exist, what cost and benefit ranges are plausible. Currently, this requires expensive consulting engagements or ad hoc internal analysis. We ask: can an LLM generate useful structured hypotheses that accelerate (not replace) human decision-making about government AI investments?
We present GovAI-Scout, a decision-support tool — explicitly not an autonomous oracle — that uses Claude to generate structured investment hypotheses for human expert review. The system produces three outputs that a human analyst would otherwise spend weeks assembling: (1) a ranked shortlist of government sectors with scored justifications, (2) concrete use case proposals anchored to international benchmarks, and (3) preliminary economic parameter ranges for Monte Carlo stress-testing.
What this paper claims: The LLM can generate structured, reasoned starting points faster than manual research, and the econometric engine can quantify how uncertain those starting points are.
What this paper does NOT claim: That LLM-generated parameters are accurate, that the system replaces human judgment, or that the NPV figures constitute investment recommendations. Every output requires expert validation before any real decision.
Our contributions:
- A structured hypothesis generation workflow where the LLM produces constrained JSON outputs (sector scores, use cases, parameter estimates) that serve as starting points for human refinement.
- A Monte Carlo uncertainty quantification engine that stress-tests LLM-generated parameters under government-realistic failure modes (Standish CHAOS 2020, Flyvbjerg 2009, HM Treasury Green Book 2022) — revealing how sensitive conclusions are to input assumptions.
- Ablation comparison showing the LLM produces measurably different (not provably better) outputs than a hand-coded baseline, with 29% score divergence and qualitatively richer justifications.
- Demonstration on Brazil and Saudi Arabia illustrating how the same workflow adapts to different institutional contexts.
System Architecture
Design Philosophy: Hypothesis Generation, Not Prediction
The system explicitly separates three concerns:
Hypothesis generation (LLM). Claude receives structured country data and produces scored sector assessments, use case proposals, and parameter estimates via constrained JSON prompts. These are hypotheses — informed starting points — not verified facts. The LLM may produce plausible-sounding but incorrect justifications (a known limitation we do not attempt to hide).
Uncertainty quantification (Monte Carlo). The econometric engine does NOT validate LLM outputs. It answers a different question: "Given these parameter ranges, how likely is a positive outcome, and what are the tail risks?" This quantifies parameter uncertainty, not model accuracy. We are explicit that sophisticated simulation on speculative inputs produces speculative outputs — the value is understanding sensitivity, not precision.
Human validation (required). The system produces a structured brief — not a recommendation. A domain expert must verify: Are the LLM's sector justifications factually correct? Are the benchmark references real and applicable? Are the parameter ranges reasonable for this specific context? Without this step, the outputs are preliminary hypotheses only.
Prompts and Constraints
All three prompts are provided verbatim for reproducibility:
Prompt 1 — Country Analysis:
System: You are GovAI-Scout, an expert in government digital
transformation. Respond with JSON only.
User: Analyze this country for government AI deployment readiness:
Country: {country} | GDP: {gdp} | Workforce: {workforce}
Context: {context}
Return ONLY JSON: {"readiness_score": <0-100>,
"assessment": "<2 sentences>", "top_3_opportunities": [...],
"key_constraints": [...], "recommended_approach": "..."}Prompt 2 — Sector Scoring:
Score 8 government sectors 1-10 on: labor_intensity,
process_repetitiveness, citizen_volume, data_maturity,
benchmark_gap, political_feasibility.
Return JSON with scores AND one-sentence justification per sector.Prompt 3 — Parameter Derivation:
Identify top AI use case. Derive economic parameters via:
benchmark anchor → country discount → conservative adjustment.
Return JSON with derivation_steps showing each calculation.JSON schema constraints prevent free-form narrative. If a response fails parsing, the prompt is retried with explicit error feedback.
Addressing "Hallucinated Precision"
We acknowledge a fundamental limitation: when the LLM outputs "0.05% collection uplift," this number comes from training data synthesis, not verified calculation. We address this three ways:
- The number is a distribution mode, not a point estimate. It becomes the center of a Triangular(0.025%, 0.05%, 0.10%) distribution explored across 5,000 Monte Carlo runs.
- Sensitivity analysis reveals dependence. If the conclusion (positive NPV) changes when this parameter varies ±20%, we flag it as a high-sensitivity assumption requiring expert validation.
- We never claim the number is correct. It is a structured hypothesis that a human analyst should verify against actual country-specific data before use.
Methodology
AI Opportunity Index
s = \sum{d=1}^{6} w_d \cdot S_{s,d} \times 10
Weights from AHP literature: Frey & Osborne 2017 (automation dimensions), Janssen et al. 2020 (feasibility dimensions), World Bank GovTech 2022 (impact dimensions). We acknowledge the weighted sum is a simplification that does not capture dimension interdependencies.
Monte Carlo with Government Failure Modes
The simulation models five risk factors:
| Factor | Distribution | Source |
|---|---|---|
| Procurement delay | Uniform(6, 24) months | OECD Government at a Glance 2023 |
| Cost overrun | 45% prob × Uniform(1.1, 1.6) | Standish Group CHAOS 2020 |
| Political defunding | 3-5% annual Bernoulli | Flyvbjerg, Oxford Rev Econ Policy 2009 |
| Adoption ceiling | Uniform(0.65, 0.85) | World Bank GovTech 2022 |
| Benefit uncertainty | Uniform(0.5, 1.5) multiplier | HM Treasury Green Book 2022 |
Important caveat: These distributions quantify how uncertain we are about the inputs. They do NOT validate whether the inputs are correct. A Monte Carlo on wrong inputs produces precisely wrong outputs. This is why human validation is essential.
Parameter Derivation Chain
- Benchmark anchor: Published result (e.g., HMRC: 1.5% uplift, UK NAO HC 978, 2022-23)
- Country discount: Readiness ratio (target / benchmark country)
- Conservative adjustment: Scaled by institutional distance; magnitude is a modeling judgment, sensitivity-tested
- Distribution fit: Parameter becomes center of probability distribution, not a point estimate
Ablation: LLM vs Baseline
We compare LLM-generated scores against a hand-coded baseline for Brazil. We do NOT claim the LLM is more accurate — only that it produces measurably different outputs with richer justifications.
| Sector | Dimension | Baseline | LLM | LLM Justification |
|---|---|---|---|---|
| Tax & Revenue | labor_intensity | 7 | 6 | "Auditors are skilled knowledge workers, not manual labor" |
| Tax & Revenue | benchmark_gap | 8 | 9 | "BRL 5.4T at 75% of GDP is among largest gaps globally" |
| Judiciary | political_feasibility | 5 | 4 | "Constitutional judicial independence makes reform sensitive" |
| Healthcare | data_maturity | 5 | 4 | "SUS fragmented across 5,570 autonomous municipalities" |
| Municipal | citizen_volume | 8 | 7 | "Volume distributed across municipalities, reducing per-entity impact" |
Observations (not claims):
- 29% score divergence demonstrates the LLM is not reproducing the baseline
- LLM justifications reference specific institutional features (constitutional provisions, municipality count)
- Both methods select the same top sector (Tax & Revenue), suggesting convergent validity
- Whether LLM nuances improve decision quality is an empirical question we cannot answer without ground truth
What would constitute proper validation: A panel of 3+ government digital transformation experts independently scoring the same sectors, with inter-rater reliability analysis comparing LLM scores to expert consensus. This is beyond the scope of this paper but is the necessary next step.
Results (Preliminary Hypotheses, Not Recommendations)
Brazil: Discovery Mode
LLM selects Tax & Revenue (AOI: 81.0). Use case: compliance risk scoring.
| Metric | Value | Interpretation |
|---|---|---|
| NPV (10yr, 8%) | BRL 3,361M | Positive under base assumptions |
| IRR | 50% | Within range of comparable projects |
| P(NPV > 0) | 81.5% | 18.5% probability of negative outcome |
| P5 worst case | BRL -679M | Genuine downside exists |
Saudi Arabia: Targeted Mode
LLM confirms Municipal Services as top (AOI: 80.0). Use case: permit automation.
| Metric | Value | Interpretation |
|---|---|---|
| NPV (10yr, 6%) | SAR 1,119M | Positive under base assumptions |
| IRR | 38% | Conservative for govt IT |
| P(NPV > 0) | 84.5% | 15.5% probability of negative outcome |
| P5 worst case | SAR -378M | Genuine downside exists |
Context: Historical Government IT Outcomes
| Project | BCR | Source |
|---|---|---|
| HMRC Connect | 10-15:1 | UK NAO HC 978, 2022-23 |
| IRS enforcement | 5-12:1 | IRS Publication 1500, 2023 |
| Singapore BCA | 2.8:1 | BCA Annual Report 2023 |
| Our Brazil estimate | 4.0:1 | Within range but unvalidated |
| Our Saudi estimate | 2.5:1 | Within range but unvalidated |
Our estimates fall within the range of historical outcomes. This suggests plausibility, not accuracy. The estimates have not been validated by domain experts or compared to actual deployment results.
Discussion
What This System Is Good For
The system accelerates the early-stage scoping phase of government AI investment analysis. A human analyst using GovAI-Scout can generate a structured investment hypothesis in hours rather than weeks. The Monte Carlo then reveals which assumptions the conclusion is most sensitive to, focusing expert validation effort on the parameters that matter most.
What This System Is NOT Good For
It cannot replace domain expertise. It cannot verify its own outputs. It should not be used to make actual investment decisions without human expert review of every assumption. The NPV and IRR figures are sensitivity-tested hypotheses, not forecasts.
Limitations
- No ground truth validation. We show divergence from baseline, not superiority. Expert panel validation is the necessary next step.
- LLM parameter hallucination. Financial parameters are training-data-derived hypotheses, not verified estimates. The Monte Carlo quantifies how sensitive conclusions are to these assumptions, but cannot verify them.
- Two-country demonstration. Insufficient to claim generalizability. Each additional country would strengthen (or weaken) the applicability evidence.
- Sophistication does not equal accuracy. Monte Carlo simulation on speculative inputs produces speculative outputs with confidence intervals. This is useful for understanding sensitivity but should not be confused with predictive validity.
Conclusion
GovAI-Scout demonstrates that LLMs can accelerate the hypothesis-generation phase of government AI investment appraisal — producing structured, reasoned starting points that would otherwise require weeks of manual research. The Monte Carlo engine then reveals which assumptions matter most, focusing expert validation on high-sensitivity parameters. We are explicit that this is a decision-support tool producing preliminary hypotheses, not an autonomous oracle producing investment recommendations. The necessary next step is expert panel validation comparing LLM-generated assessments against human domain expert consensus.
References (all 2024 or earlier)
- Frey C.B. & Osborne M.A., "The Future of Employment," Tech. Forecasting & Social Change 114, 2017.
- Mehr H., "AI for Citizen Services," Harvard Ash Center, 2017.
- Janssen M. et al., "Data governance for trustworthy AI," GIQ 37(3), 2020.
- Standish Group, "CHAOS Report 2020," 2020.
- UK HM Treasury, "The Green Book," 2022.
- Flyvbjerg B., "Survival of the Unfittest," Oxford Rev. Econ. Policy 25(3), 2009.
- World Bank, "GovTech Maturity Index," 2022.
- UK NAO, "HMRC Tax Compliance," HC 978, 2022-23.
- OECD, "Tax Administration 2023," 2023.
- OECD, "Government at a Glance 2023," 2023.
- IMF, "World Economic Outlook," Oct 2024.
- IBGE, "Continuous PNAD," Jul 2024.
- Longinotti F.P., "Tax Gap in LAC," CIAT WD 5866, 2024.
- Chambers, "Tax Controversy 2024: Brazil," 2024.
- CNJ, "Justica em Numeros 2024," 2024.
- UN DESA, "E-Government Survey 2024," Sep 2024.
- GASTAT, "Labour Force Survey Q3 2024," 2024.
- Saudi MOF, "Budget Statement FY2024," 2023.
- IRS, "ROI in Tax Enforcement," Pub 1500, 2023.
- Singapore BCA, "Annual Report 2022/2023," 2023.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: govai-scout description: > LLM-powered decision-support tool that generates structured investment hypotheses for government AI opportunities. Claude produces sector scores, use cases, and parameter estimates via constrained JSON prompts. Monte Carlo stress-tests assumptions under government failure modes. Outputs require human expert validation — this is a hypothesis generator, not an oracle. allowed-tools: Bash(python *), Bash(pip *) --- # GovAI-Scout: Decision Support for Government AI Investment ## What It Does Generates structured starting points for human analysts: 1. Ranked sector shortlist with scored justifications (LLM-generated) 2. Use case proposals anchored to international benchmarks (LLM-generated) 3. Monte Carlo stress-test revealing which assumptions matter most (deterministic) ## What It Does NOT Do - Replace human judgment - Produce investment recommendations - Guarantee parameter accuracy Every output is a hypothesis requiring expert validation. ## Results (Preliminary, Unvalidated) | | Brazil | Saudi Arabia | |---|---|---| | NPV | BRL 3,361M | SAR 1,119M | | IRR | 50% | 38% | | P(NPV>0) | 81.5% | 84.5% | | Status | Hypothesis | Hypothesis | ## Execution ```bash pip install numpy scipy pandas matplotlib seaborn --break-system-packages python govai_scout_v4.py ```
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.