An Executable Workflow for Identifying Digital Governance Outperformers: Random Forest on Non-Overlapping EGDI Predictors with Cross-Validation and Feature Ablation
Introduction
We present an executable workflow that explains UN E-Government Development Index (EGDI) scores from four socioeconomic indicators, identifies countries outperforming their development level, and produces publication-ready visualizations. The workflow trains a Random Forest on 2018-2020 EGDI data, validates on held-out 2022 scores, and compares against three baselines — all in a single Python script requiring only NumPy and Matplotlib.
Key result: Using four features that do not overlap with EGDI sub-components (GDP per capita, CPI, urbanization, government expenditure), the model achieves R² = 0.935 on 52 held-out countries — outperforming GDP-alone (R² = 0.854) by 8.1 percentage points, demonstrating multivariate explanatory value beyond wealth.
Design Decisions
Why exclude internet penetration and schooling? These are direct inputs to EGDI's Telecommunication Infrastructure Index and Human Capital Index respectively. Including them would create circular predictions. We retain only features with zero EGDI sub-component overlap.
Why Random Forest over OLS? The GDP-EGDI relationship is non-linear: moving from 4K GDP per capita has much larger EGDI impact than 80K. Linear regression achieves R² = 0.778; Random Forest captures these non-linearities for R² = 0.935 without manual feature engineering.
Why not just use persistence (prior scores)? Persistence (2020→2022) achieves R² = 0.987 for forecasting. But it cannot explain why countries score what they do, or identify which countries outperform their development level. Our model is explanatory, not predictive.
Validation Summary
| Model | Test R² | Test MAE |
|---|---|---|
| Persistence (2020→2022) | 0.987 | 0.013 |
| Random Forest (4 non-overlapping features) | 0.935 | 0.036 |
| GDP-only Random Forest | 0.854 | 0.055 |
| Linear Regression (4 features) | 0.778 | 0.064 |
Cross-validation: 5-fold CV on training data yields R² = 0.882 ± 0.028, confirming stable generalization.
Feature ablation: Dropping GDP reduces R² to 0.869 (still strong); dropping CPI reduces to 0.922; dropping urbanization or gov expenditure reduces to 0.922-0.928. The model without GDP still explains 87% of variance, confirming genuine multivariate power.
Feature Importance
| Feature | Importance |
|---|---|
| GDP per capita | 72.2% |
| CPI (corruption perceptions) | 20.6% |
| Urbanization | 3.8% |
| Government expenditure | 3.4% |
GDP and institutional quality (CPI) jointly account for 92.8% of explanatory power. Public spending level alone is a weak predictor — what matters is economic capacity and governance quality, not how much government spends.
Policy Outperformers
Countries with large positive residuals (actual EGDI > predicted) achieve digital maturity beyond what their socioeconomic indicators would suggest. These residuals are associated with deliberate digital policy — not proven to be caused by it — and could also reflect unmeasured factors (foreign aid, demographic structure, diaspora effects, measurement methodology).
| Country | Actual | Predicted | Residual |
|---|---|---|---|
| Saudi Arabia | 0.880 | 0.805 | +0.075 |
| Rwanda | 0.430 | 0.370 | +0.060 |
| Vietnam | 0.680 | 0.630 | +0.050 |
| Bahrain | 0.810 | 0.757 | +0.053 |
| South Korea | 0.952 | 0.908 | +0.044 |
Saudi Arabia shows the largest positive residual (+0.075). The UAE (similar GDP, higher CPI) shows near-zero residual (-0.009), suggesting Saudi outperformance is not a generic Gulf wealth effect but is consistent with the specific digital investments of Vision 2030 (Absher, Tawakkalna, SDAIA, Nafath). A causal interpretation would require additional controls.
Workflow Output
The script produces:
- Console output: Train/test metrics, baselines, 5-fold CV, feature ablation, country-level predictions
- Charts: actual-vs-predicted scatter, residual bar chart, feature importance, model comparison
- JSON: Full results file for downstream processing
All outputs are deterministic (random seed 42) and reproduce identically across runs.
Limitations
- 54 of 193 countries — selection bias toward data-complete nations. The model can predict for any country with the four indicators; expanding the dataset is the priority.
- Persistence beats it for forecasting — this is an explanatory tool, not a forecaster.
- Residuals are associative — multiple confounders could explain positive residuals.
- COVID-era training data — 2020 data reflects pandemic conditions; strong 2022 test performance suggests robustness but pandemic-driven digitization may inflate 2020 baseline scores.
- 104 training observations — modest sample limits model complexity. 5-fold CV (R² = 0.882 ± 0.028) provides a conservative generalization estimate.
References
- UN DESA, "E-Government Survey 2018," 2018.
- UN DESA, "E-Government Survey 2020," 2020.
- UN DESA, "E-Government Survey 2022," 2022.
- World Bank, "World Development Indicators," 2024.
- IMF, "World Economic Outlook Database," Oct 2024.
- Transparency International, "Corruption Perceptions Index," 2018-2022.
- Breiman L., "Random Forests," Machine Learning 45(1), 2001.
- Krishnan S. et al., "E-government maturity," Information & Management 50(8), 2013.
- Zhao F. et al., "Digital divide and e-government," IT & People 27(1), 2014.
- Ingrams A. et al., "Transparency and open government," Perspectives on Public Mgmt & Gov 3(4), 2020.
- Singh H. et al., "Building digital government," GIQ 37(3), 2020.
- UN DESA, "E-Government Survey 2024," Sep 2024.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: egdi-predictor description: > Executable workflow that explains government digital maturity (EGDI) from 4 non-overlapping socioeconomic indicators. Random Forest R²=0.935 on held-out 2022 data. Outperforms GDP-only by +0.081 R². 5-fold CV confirms generalization. Identifies policy outperformers via residuals. Produces 4 publication-ready charts. Pure NumPy + Matplotlib. allowed-tools: Bash(python *), Bash(pip *) --- # EGDI Explanatory Workflow Explains government digital maturity from GDP, CPI, urbanization, gov spending. Validates on held-out 2022 EGDI scores. Produces charts + JSON. ## Prerequisites ```bash pip install numpy matplotlib --break-system-packages ``` ## Run ```bash python egdi_predictor.py ``` ## Output - Console: metrics, baselines, 5-fold CV, ablation, country predictions - `output/charts/`: 4 PNG charts (scatter, residuals, importance, comparison) - `output/results.json`: structured results
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.