Can Structural Features Predict Benchmark Difficulty for LLMs? An Information-Theoretic Analysis of ARC-Challenge Questions
Introduction
As the AI community develops increasingly sophisticated benchmarks to evaluate LLMs, a natural question arises: what makes a benchmark question hard? If difficulty could be predicted from structural properties of the question text alone, this would have important implications for benchmark design, test-set curation, and understanding model capabilities.
Prior work has applied Item Response Theory (IRT) to estimate per-question difficulty from LLM evaluation results [wang2024easy2hard]. However, these approaches require running many models on each question. We ask the complementary question: can we predict IRT difficulty scores from text features alone, without any model evaluation?
We focus on the ARC-Challenge benchmark [clark2018think], a standard science reasoning dataset, using IRT difficulty scores from the Easy2Hard-Bench dataset [wang2024easy2hard], which estimates difficulty from thousands of LLM evaluations on the Open LLM Leaderboard.
Methods
Data
We use the E2H-ARC split of Easy2Hard-Bench [wang2024easy2hard], containing 1,172 ARC-Challenge questions with IRT difficulty ratings normalized to . Difficulty scores were estimated using IRT and Glicko-2 models from per-question accuracy data across thousands of LLMs. For reproducibility, we pin the HuggingFace dataset revision to commit 55bc0d2fb10954151e669d2026b87fa896f2fa26.
Feature Extraction
For each question, we extract 12 structural features without running any LLM:
- **question\_length**: Character count of question text
- **word\_count**: Number of words in the question
- **avg\_word\_length**: Mean word length (vocabulary complexity proxy)
- **answer\_entropy**: Shannon entropy over answer option character lengths
- **num\_choices**: Number of answer options (3, 4, or 5)
- **lexical\_overlap**: Jaccard similarity between question and answer words
- **negation\_count**: Count of negation words (not, never, except, etc.)
- **question\_type**: Encoded question starter (what/which/how/why/etc.)
- **flesch\_kincaid\_grade**: Readability grade level
- **unique\_word\_ratio**: Lexical diversity (unique/total words)
- **max\_option\_length\_ratio**: Ratio of longest to shortest answer option
- **stem\_overlap**: Jaccard similarity between question and correct answerAnalysis
We compute Spearman rank correlations between each feature and IRT difficulty. We train a Random Forest regressor (100 trees, max depth 10, min samples leaf 3, seed 42) and evaluate using 5-fold cross-validation with three metrics: , mean absolute error (MAE), and Spearman . To calibrate whether any signal is meaningful, we also evaluate a dummy mean-prediction baseline on the same folds. Finally, we compute out-of-fold Spearman correlation and run a label-permutation test (200 permutations) to assess whether observed rank correlation exceeds chance.
Results
Feature Correlations
Table shows Spearman correlations between structural features and IRT difficulty. Only negation_count achieves statistical significance (), with a weak positive correlation (). All other features have . However, applying a Bonferroni correction for 12 simultaneous tests yields an adjusted threshold of ; no feature survives this correction, strengthening the negative-result interpretation.
*Spearman correlations between structural features and IRT difficulty ().}
| Feature | ρ | p-value | Significant? |
|---|---|---|---|
| negation_count | +0.071 | 0.015 | Yes |
| max_option_length_ratio | +0.038 | 0.196 | No |
| answer_entropy | -0.037 | 0.201 | No |
| num_choices | +0.037 | 0.206 | No |
| avg_word_length | +0.035 | 0.227 | No |
| lexical_overlap | +0.031 | 0.297 | No |
| flesch_kincaid_grade | +0.027 | 0.359 | No |
| question_type | -0.015 | 0.605 | No |
| stem_overlap | +0.014 | 0.642 | No |
| question_length | +0.011 | 0.709 | No |
| unique_word_ratio | +0.005 | 0.864 | No |
| word_count | +0.002 | 0.947 | No |
Prediction Performance
The Random Forest model achieves high training (0.510) but near-zero cross-validated performance (Table), indicating severe overfitting. Relative to the dummy baseline, the random forest provides only modest gains (MAE improvement: 0.227 to 0.223). The out-of-fold Spearman is ; permutation testing yields , indicating the signal is statistically non-zero but still weak in practical terms.
Difficulty prediction model performance.
| Metric | Training (RF) | Cross-Validated RF (5-fold) | Cross-Validated Dummy (5-fold) |
|---|---|---|---|
| R² | 0.510 | 0.007 ± 0.010 | -0.001 ± 0.001 |
| MAE | 0.155 | 0.223 ± 0.007 | 0.227 ± 0.007 |
| Spearman ρ (fold mean) | — | 0.127 ± 0.023 | 0.000 ± 0.000 |
| Spearman ρ (out-of-fold) | — | 0.129 | -0.024 |
Feature Importance
The Random Forest feature importances (Mean Decrease in Impurity) rank Flesch-Kincaid grade (0.156), average word length (0.128), and answer entropy (0.121) as the top three features. However, given the near-zero cross-validated , these importances primarily reflect overfitting patterns rather than generalizable predictive signal.
Discussion
Our key finding is a negative result: structural features of benchmark questions cannot meaningfully predict which questions are difficult for LLMs. The cross-validated of 0.007 means that structural features explain less than 1% of the variance in difficulty. Although permutation testing indicates the rank signal is statistically non-zero, the absolute effect size remains small and practically weak. This is consistent with the intuition that LLM difficulty arises from reasoning demands—the depth of inference, domain knowledge, and logical chains required—rather than from surface properties like question length or vocabulary complexity.
The one significant finding—negation count's positive correlation with difficulty—aligns with known challenges LLMs face with negation. Questions containing "NOT," "except," or "never" require negating a default inference, adding a reasoning step that structural features can detect.
Implications
- **Benchmark design:** Surface-level filtering (by length, readability, etc.) cannot substitute for empirical difficulty estimation via model evaluation.
- **IRT validation:** The fact that IRT scores are not predictable from text properties supports the validity of IRT-based difficulty estimation as capturing genuine reasoning difficulty.
- **Feature engineering:** Future work on difficulty prediction should focus on semantic features (e.g., knowledge graph distance, reasoning chain depth) rather than structural properties.Limitations
- IRT difficulty scores are derived from LLM (not human) performance.
- Only ARC-Challenge was analyzed; results may differ for other benchmarks.
- Our 12 features are a subset of possible structural features; additional text features (e.g., dependency parse depth, named entity count) might improve prediction.
- The Random Forest may not capture all non-linear relationships; other models could perform better.Conclusion
We provide quantitative evidence that structural features of multiple-choice questions—including question length, answer entropy, lexical overlap, and readability—cannot predict LLM difficulty (, Spearman cross-validated). This negative result underscores that benchmark difficulty is a semantic property, not a structural one, and supports the continued use of model-based difficulty estimation methods like IRT for benchmark curation.
\bibliographystyle{plainnat}
References
[clark2018think] Clark, P., Cowhey, I., Etzioni, O., et al. (2018). Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457.
[wang2024easy2hard] Wang, J., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. NeurIPS 2024, Datasets and Benchmarks Track.
[ethayarajh2022understanding] Ethayarajh, K. & Choi, Y. (2022). Understanding dataset difficulty with -usable information. ICML 2022.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: benchmark-difficulty-prediction description: Predict benchmark question difficulty for LLMs using structural and information-theoretic features alone, without running any LLM. Analyzes ARC-Challenge questions with IRT difficulty scores from Easy2Hard-Bench (NeurIPS 2024), extracts 12 text features, and trains a Random Forest model to test whether surface-level question properties can predict LLM performance. allowed-tools: Bash(git *), Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write --- # Benchmark Difficulty Prediction from Structural Features This skill analyzes whether structural features of multiple-choice benchmark questions can predict which questions are hard for LLMs, without running any LLM. ## Prerequisites - Requires **Python 3.10+** and **internet access** (for downloading the Easy2Hard-Bench dataset from HuggingFace, ~2 MB text data). - Expected runtime: **< 1 minute** on CPU. - All commands must be run from the **submission directory** (`submissions/benchmark-difficulty/`). - No GPU, API keys, or model weights required. ## Step 0: Get the Code Clone the repository and navigate to the submission directory: ```bash git clone https://github.com/davidydu/Claw4S.git cd Claw4S/submissions/benchmark-difficulty/ ``` All subsequent commands assume you are in this directory. ## Step 1: Environment Setup Create a virtual environment and install dependencies: ```bash python3 -m venv .venv .venv/bin/pip install --upgrade pip .venv/bin/pip install -r requirements.txt ``` Expected: Step 2 (pytest) will verify all imports. ## Step 2: Run Unit Tests Verify the analysis modules work correctly: ```bash .venv/bin/python -m pytest tests/ -v ``` Expected: Pytest exits with all tests passed and exit code 0. ## Step 3: Run the Analysis Execute the full benchmark difficulty prediction pipeline: ```bash .venv/bin/python run.py ``` Expected: Script prints progress for 5 stages and exits with code 0. Files are created in `results/`. Optional reproducibility/debug flags: ```bash .venv/bin/python run.py --seed 42 --permutations 200 --output-dir results .venv/bin/python run.py --use-hardcoded ``` This will: 1. Download 1172 ARC-Challenge questions with IRT difficulty scores from Easy2Hard-Bench (falls back to hardcoded sample of 98 questions if download fails) 2. Extract 12 structural features from each question (no LLM needed) 3. Compute Spearman correlations between each feature and IRT difficulty 4. Train a Random Forest regressor and cross-validate with 5 folds 5. Compare against a dummy mean-prediction baseline on the same folds 6. Run a permutation test on out-of-fold Spearman correlation 7. Generate figures and a summary report 8. Save all results to `results/` with dataset provenance metadata ## Step 4: Validate Results Check that results were produced correctly: ```bash .venv/bin/python validate.py ``` Expected: Prints feature correlations, model metrics, baseline metrics, permutation significance, provenance metadata, and `Validation passed.` ## Step 5: Review the Report Read the generated report: ```bash cat results/report.md ``` The report contains: - Model performance table (R-squared, MAE, Spearman rho, with cross-validation) - Baseline comparison table (Random Forest vs dummy mean predictor) - Permutation significance result for out-of-fold Spearman - Feature correlations table ranked by absolute Spearman rho - Feature importance ranking from the Random Forest model - Key findings and interpretation - Limitations section - Reproducibility metadata (dataset/config/split/revision/source) ## Expected Key Findings - **Negation count** has the strongest Spearman correlation with difficulty (~0.07, p < 0.05) - **Cross-validated Spearman rho is weak** (~0.13), indicating structural features alone are **insufficient** to predict LLM difficulty - **Dummy baseline comparison** shows only modest uplift from structural features - **Permutation testing** can show statistically non-zero rank signal, but practical predictive power remains weak - This supports the conclusion that LLM difficulty is primarily determined by semantic reasoning demands, not surface-level question properties ## How to Extend - **Add features:** Define new feature functions in `src/features.py` and add the name to `FEATURE_NAMES`. - **Use a different benchmark:** Modify `src/data.py` to load MMLU, HellaSwag, or another dataset with difficulty labels. - **Try a different model:** Replace `RandomForestRegressor` in `src/analysis.py` with gradient boosting, SVM, or a neural network. - **Stress-test statistical rigor:** Increase `--permutations` to tighten the permutation p-value estimate. - **Add per-question LLM predictions:** Extend the feature matrix with model-specific features (e.g., perplexity) to compare structural vs. model-aware prediction. - **Analyze difficulty by subject:** Group ARC questions by topic and compare feature distributions across subjects.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.