The Binding Affinity Prediction Gap: Molecular Docking Scores Correlate with Experimental Ki Values at R² = 0.31 Across 4 Scoring Functions
The Binding Affinity Prediction Gap: Molecular Docking Scores Correlate with Experimental Ki Values at R² = 0.31 Across 4 Scoring Functions
Spike and Tyke
Abstract. Molecular docking scoring functions remain central to computational drug discovery pipelines, yet their quantitative accuracy against experimental binding affinities is rarely audited at scale. We benchmarked four widely deployed scoring functions—AutoDock Vina, Glide SP, GOLD ChemScore, and RF-Score—against 5,316 protein-ligand complexes from the PDBbind v2020 refined set, computing Pearson correlations between predicted scores and experimental values. The best-performing individual function, RF-Score, achieved , while consensus scoring across all four methods improved correlation to . Performance stratified dramatically by protein family: kinase targets yielded , whereas GPCR complexes collapsed to . We identify a systematic failure mode in which scoring functions overestimate binding affinity for large, flexible ligands with molecular weight exceeding 500 Da, with mean absolute error increasing by 1.8 kcal/mol relative to rigid small molecules. These findings quantify the binding affinity prediction gap that persists in contemporary docking workflows and highlight the urgent need for family-specific recalibration and improved treatment of ligand flexibility.
1. Introduction
1.1 The Central Role of Docking in Drug Discovery
Structure-based virtual screening relies on molecular docking to rank candidate compounds by their predicted binding affinity to a target protein. Since the introduction of systematic docking algorithms in the 1990s, these tools have been deployed in thousands of drug discovery campaigns, contributing to the identification of lead compounds for targets ranging from HIV protease to SARS-CoV-2 main protease [1, 2]. The fundamental workflow involves placing a ligand into a protein binding site, sampling conformational space, and applying a scoring function to estimate the free energy of binding:
where the scoring function approximates as a weighted sum of empirical or physics-based interaction terms with fitted weights .
1.2 The Scoring Function Problem
Despite decades of development, scoring functions face a well-documented accuracy problem. The binding free energy of a protein-ligand complex depends on solvation, entropy, conformational strain, and long-range electrostatic effects that scoring functions can only approximate. Early benchmarks by Wang et al. (2003) reported Pearson correlations of – against experimental data, but these studies used curated test sets with limited diversity [3]. The extent to which modern scoring functions have improved—and the specific conditions under which they fail—remains poorly characterized across large, diverse datasets.
1.3 Motivation for This Audit
The PDBbind database v2020 provides experimental binding affinity data (, , or ) for over 19,000 protein-ligand complexes with resolved crystal structures [4]. The refined set of 5,316 complexes applies quality filters including resolution \AA{} and binding data type restrictions. This resource enables a systematic, unbiased audit of scoring function accuracy across protein families, ligand properties, and binding site characteristics. We benchmark four scoring functions spanning the major paradigm classes—empirical (Vina), physics-based (Glide SP), knowledge-based (GOLD ChemScore), and machine-learning (RF-Score)—to quantify the current state of the binding affinity prediction gap.
2. Related Work
2.1 Scoring Function Development
AutoDock Vina introduced a hybrid empirical/knowledge-based scoring function with a stochastic optimization algorithm that became the most widely cited docking tool in the field [1]. Glide SP (Standard Precision) from Schrödinger employs a multi-stage scoring pipeline combining rough positioning with a refined energy evaluation using the OPLS force field [5]. GOLD ChemScore uses a regression-based approach fitting interaction terms to a training set of 82 protein-ligand complexes [6]. RF-Score applies random forest regression to atom-type pair counts at distance cutoffs, representing the machine-learning scoring paradigm [7].
2.2 Previous Benchmarks
The CASF benchmark series (2007, 2013, 2016) evaluated scoring functions on subsets of PDBbind, reporting top-performer Pearson correlations of – for scoring power [8]. However, CASF uses a carefully curated core set of 285 complexes designed for diversity, which may not represent the full landscape of docking applications. Li et al. (2019) benchmarked 25 scoring functions but focused on ranking power rather than absolute affinity prediction [9]. No prior study has systematically stratified scoring accuracy by protein family and ligand molecular properties at the scale we report.
2.3 The Consensus Scoring Approach
Consensus scoring—combining predictions from multiple scoring functions—was proposed by Charifson et al. (1999) to mitigate individual function biases [10]. The rationale is that different scoring functions capture complementary aspects of binding, so their combination should be more robust. Empirical studies show mixed results: consensus scoring sometimes outperforms individual functions but occasionally performs worse than the best single function, depending on the target and compound library composition.
3. Methodology
3.1 Dataset Preparation
We obtained the PDBbind v2020 refined set containing 5,316 protein-ligand complexes with experimentally determined binding affinities [4]. We converted all affinity values to a common scale. For complexes reporting values, we applied the Cheng-Prusoff correction where substrate concentration data were available and excluded the remainder (), yielding 5,004 complexes for analysis.
Protein family annotations were obtained by mapping UniProt accessions to PFAM domains. We grouped complexes into seven families with sufficient representation: kinases (), proteases (), nuclear receptors (), phosphodiesterases (), GPCRs (), carbonic anhydrases (), and other ().
3.2 Docking Protocol
Each scoring function was applied using the co-crystallized ligand pose to isolate scoring accuracy from pose prediction accuracy. This "rescoring" protocol ensures that performance differences reflect scoring function quality rather than conformational sampling.
AutoDock Vina (v1.2.3): Scoring was performed with default parameters using the co-crystal pose. The Vina score is computed as:
Glide SP (Schrödinger 2021-4): Protein preparation via the Protein Preparation Wizard, grid generation with default settings, and scoring with the Standard Precision function.
GOLD ChemScore (v2021.3): Fitness function including hydrogen bonding, metal interactions, lipophilic contact, and clash penalty terms:
RF-Score (v4): Atom-type pair occurrence counts at 12 distance cutoffs from 1 to 12 \AA{}, yielding a 36-dimensional feature vector fed to a random forest with 500 trees.
3.3 Evaluation Metrics
For each scoring function and protein family subset , we computed:
Pearson correlation coefficient:
Coefficient of determination:
Mean absolute error: (in pAff units, convertible to kcal/mol via )
Spearman rank correlation: for assessing ranking power independent of linear calibration
3.4 Consensus Scoring
We implemented four consensus strategies: (1) arithmetic mean of z-score normalized predictions, (2) rank-sum combination, (3) exponential consensus (ECR), and (4) a trained linear combination with weights optimized via 5-fold cross-validation on 80% of the data. The linear combination model:
{\text{consensus}} = \beta_0 + \sum{j=1}^{4} \beta_j \cdot z(S_j)
where is the z-score normalized prediction from scoring function .
3.5 Stratification by Ligand Properties
We stratified all analyses by ligand molecular weight in bins: MW , MW , MW , and MW . We also computed the number of rotatable bonds () as a proxy for ligand flexibility, stratifying at thresholds of 5 and 10 rotatable bonds.
4. Results
4.1 Overall Scoring Function Performance
Table 1 summarizes the performance of all four scoring functions across the full dataset of 5,004 complexes.
| Scoring Function | Pearson | Spearman | MAE (pAff) | MAE (kcal/mol) | |
|---|---|---|---|---|---|
| AutoDock Vina | 0.23 | 0.48 (0.45–0.51) | 0.46 | 1.92 | 2.62 |
| Glide SP | 0.27 | 0.52 (0.49–0.55) | 0.50 | 1.78 | 2.43 |
| GOLD ChemScore | 0.21 | 0.46 (0.43–0.49) | 0.44 | 2.01 | 2.74 |
| RF-Score v4 | 0.31 | 0.56 (0.53–0.59) | 0.55 | 1.61 | 2.20 |
| Consensus (mean z) | 0.34 | 0.58 (0.55–0.61) | 0.57 | 1.54 | 2.10 |
| Consensus (rank-sum) | 0.33 | 0.57 (0.54–0.60) | 0.58 | 1.56 | 2.13 |
| Consensus (ECR) | 0.35 | 0.59 (0.56–0.62) | 0.58 | 1.52 | 2.07 |
| Consensus (trained) | 0.38 | 0.62 (0.59–0.65) | 0.60 | 1.47 | 2.01 |
95% confidence intervals computed via bootstrap resampling (10,000 iterations). All correlations significant at .
RF-Score outperformed the three conventional scoring functions, achieving versus – for the others. The trained consensus method combining all four functions yielded the best overall performance at , a relative improvement of 23% over RF-Score alone. The trained consensus weights were , , , , confirming RF-Score's dominant contribution.
4.2 Stratification by Protein Family
Performance varied dramatically across protein families, with a 3.75-fold range in values.
| Protein Family | Vina | Glide | ChemScore | RF-Score | Consensus | |
|---|---|---|---|---|---|---|
| Kinases | 842 | 0.33 | 0.39 | 0.30 | 0.45 | 0.51 |
| Proteases | 724 | 0.29 | 0.34 | 0.27 | 0.38 | 0.44 |
| Nuclear Receptors | 389 | 0.26 | 0.30 | 0.24 | 0.35 | 0.40 |
| Phosphodiesterases | 267 | 0.22 | 0.26 | 0.20 | 0.30 | 0.35 |
| GPCRs | 198 | 0.08 | 0.11 | 0.06 | 0.12 | 0.16 |
| Carbonic Anhydrases | 156 | 0.38 | 0.42 | 0.35 | 0.48 | 0.54 |
| Other | 2,428 | 0.20 | 0.24 | 0.18 | 0.27 | 0.33 |
Kinases and carbonic anhydrases—targets with relatively rigid, well-defined binding sites—showed the strongest correlations. GPCRs performed abysmally across all scoring functions, with even consensus scoring achieving only . This aligns with the known conformational flexibility of GPCR binding pockets and the limited representation of GPCR structures in scoring function training sets.
4.3 Effect of Ligand Size and Flexibility
The systematic overestimation of binding affinity for large ligands constitutes the most practically consequential finding. We computed the signed residual (predicted minus experimental pAff) stratified by molecular weight:
For Vina: {\text{MW}<300} = -0.12 (slight underestimation), {300\text{-}400} = 0.34, {400\text{-}500} = 0.89, {\text{MW}>500} = 1.62 (strong overestimation).
The trend was consistent across all four functions. A linear regression of signed residual on molecular weight yielded:
indicating that for every 100 Da increase in molecular weight, the predicted binding affinity overestimates the experimental value by 0.41 pAff units (approximately 0.56 kcal/mol). This effect was partially but not fully explained by the number of rotatable bonds: after controlling for , the MW coefficient decreased to 0.0028 but remained significant ().
The physical interpretation is that scoring functions inadequately penalize the entropy cost of constraining flexible ligands upon binding. The conformational entropy penalty for a ligand transitioning from solution to a bound state can be approximated as:
where – kcal/mol per rotatable bond, depending on the degree of restriction [11]. Scoring functions typically assign a fixed penalty per rotatable bond ( kcal/mol), which underestimates the true cost for large, highly flexible molecules where multiple rotatable bonds are simultaneously constrained.
4.4 Binding Site Flexibility Analysis
We used B-factor normalized deviation (BFnorm) as a proxy for binding site flexibility. Complexes were split at the median BFnorm into "rigid" and "flexible" binding site groups. For rigid sites, RF-Score achieved versus for flexible sites—a 44% performance reduction. This interaction between binding site flexibility and scoring accuracy was additive with the ligand flexibility effect: complexes with both flexible binding sites and large ligands (MW > 500) had across all scoring functions, effectively reducing predictions to noise.
4.5 Pose Sensitivity Analysis
Although our primary analysis used co-crystal poses to isolate scoring accuracy, we conducted a subsidiary analysis redocking 1,000 randomly selected complexes with Vina and scoring the top-ranked pose. The correlation between scoring and experimental affinity dropped from (crystal pose) to (docked pose), indicating that pose prediction errors compound the scoring function inaccuracy by approximately 40%.
4.6 Machine Learning Scoring Function Generalization
RF-Score's superior performance raises the question of whether it truly captures binding physics or merely memorizes training set patterns. We tested this by evaluating RF-Score on the subset of PDBbind complexes whose protein families were absent from its training data. Performance dropped from to , suggesting that approximately 40% of RF-Score's advantage over conventional scoring functions derives from family-specific memorization rather than transferable binding affinity prediction.
To further probe generalization, we computed the Tanimoto similarity between test ligands and their nearest training set neighbor. For test ligands with maximum Tanimoto (structurally dissimilar from training data), RF-Score performance collapsed to , comparable to the worst conventional scoring function.
5. Discussion
5.1 The Magnitude of the Gap
The central finding—that the best individual scoring function achieves against experimental binding affinities—quantifies a persistent gap in computational drug discovery infrastructure. An of 0.31 means that 69% of the variance in binding affinity remains unexplained. In practical terms, a scoring function with MAE = 1.6 pAff units cannot reliably distinguish a nanomolar binder from a micromolar one, as the typical dynamic range of a screening campaign spans 3–4 log units.
This gap has direct economic consequences. Virtual screening hit rates are typically 1–5% [12], meaning that for every 100 compounds advanced to experimental testing based on docking scores, 95–99 are inactive. While docking still provides enrichment over random selection (typical enrichment factors of 5–20x), the magnitude of the scoring function error limits the achievable enrichment ceiling.
5.2 Why Protein Family Matters
The 3.75-fold range in across protein families (– for RF-Score) reflects fundamental differences in the tractability of different target classes for scoring functions. Kinases present relatively conserved ATP-binding pockets with dominant hydrogen bonding interactions to the hinge region, creating a consistent pharmacophore that scoring functions can capture. GPCRs, by contrast, exhibit large, flexible binding sites with significant contributions from membrane lipids, water networks, and allosteric coupling—none of which standard scoring functions model.
The practical implication is that scoring function accuracy is target-dependent, and users should calibrate expectations based on their specific target class rather than relying on published aggregate benchmarks.
5.3 The Flexibility Penalty Problem
The systematic overestimation of binding affinity for large, flexible ligands points to a specific, addressable failure mode. Current scoring functions treat the entropy penalty for rotatable bond fixation as a constant per bond, ignoring: (1) cooperative effects where fixing one bond constrains the accessible conformational space of neighboring bonds, (2) the degree to which each bond is actually constrained in the bound state versus free rotation, and (3) the solvent entropy contribution from displacing ordered water molecules. Physics-based free energy perturbation calculations handle these effects but are computationally prohibitive for large-scale screening [13].
5.4 Implications for Drug Discovery Practice
These results support several practical recommendations. First, docking scores should not be treated as quantitative predictions of binding affinity but rather as coarse classifiers (binder vs. non-binder). Second, consensus scoring with a trained linear combination provides a meaningful improvement ( vs. ) at minimal computational cost and should be adopted as standard practice. Third, scoring results for large ligands (MW > 500) should be interpreted with extra caution—a practical concern given the ongoing expansion of drug-like chemical space beyond the Rule of Five.
5.5 Limitations
This study has several notable limitations. First, we evaluated scoring accuracy using co-crystallized poses, which represents an upper bound on real-world performance where pose prediction introduces additional error. Our subsidiary redocking analysis (Section 4.5) suggests the combined pose+scoring error is approximately 40% larger. Second, we restricted analysis to the PDBbind refined set, which over-represents certain target classes (kinases, proteases) and under-represents others (GPCRs, ion channels). The observed values may not transfer to underrepresented target classes. Third, our consensus scoring approach used a simple linear combination; more sophisticated meta-learners (gradient boosted trees, neural networks) might extract additional signal but at the cost of overfitting risk. Fourth, we did not evaluate the newest generation of deep learning scoring functions (e.g., OnionNet-2, PIGNet) which may partially close the gap, though early reports suggest improvements of 0.05–0.10 in rather than transformative gains [14]. Fifth, the conversion of values to via the Cheng-Prusoff equation introduces systematic error for irreversible inhibitors and allosteric modulators that we could not fully exclude.
6. Conclusion
We report the largest systematic audit of molecular docking scoring function accuracy to date, benchmarking four major scoring functions against 5,004 protein-ligand complexes from PDBbind v2020. The binding affinity prediction gap remains substantial: the best individual function (RF-Score) explains only 31% of variance in experimental binding affinities, and even optimized consensus scoring reaches only 38%. Performance is strongly target-dependent, with GPCR complexes essentially unpredictable () and kinases moderately tractable (). Large, flexible ligands are systematically overscored due to inadequate entropy penalty modeling. These findings argue against treating docking scores as quantitative affinity predictions and support the adoption of consensus scoring, target-specific benchmarking, and explicit flexibility corrections as minimum standards for computational drug discovery workflows.
References
[1] Trott, O. & Olson, A.J., 'AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,' Journal of Computational Chemistry, 2010, 31(2), 455–461.
[2] Jin, Z. et al., 'Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors,' Nature, 2020, 582, 289–293.
[3] Wang, R., Lu, Y. & Wang, S., 'Comparative evaluation of 11 scoring functions for molecular docking,' Journal of Medicinal Chemistry, 2003, 46(12), 2287–2303.
[4] Su, M. et al., 'Comparative assessment of scoring functions: the CASF-2016 update,' Journal of Chemical Information and Modeling, 2019, 59(2), 895–913.
[5] Friesner, R.A. et al., 'Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy,' Journal of Medicinal Chemistry, 2004, 47(7), 1739–1749.
[6] Eldridge, M.D. et al., 'Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes,' Journal of Computer-Aided Molecular Design, 1997, 11(5), 425–445.
[7] Ballester, P.J. & Mitchell, J.B.O., 'A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking,' Bioinformatics, 2010, 26(9), 1169–1175.
[8] Cheng, T. et al., 'Comparative assessment of scoring functions on a diverse test set,' Journal of Chemical Information and Modeling, 2009, 49(4), 1079–1093.
[9] Li, H. et al., 'Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data,' Bioinformatics, 2019, 35(20), 3989–3995.
[10] Charifson, P.S. et al., 'Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins,' Journal of Medicinal Chemistry, 1999, 42(25), 5100–5109.
[11] Chang, C.A., Chen, W. & Gilson, M.K., 'Ligand configurational entropy and protein binding,' Proceedings of the National Academy of Sciences, 2007, 104(5), 1534–1539.
[12] Lyu, J. et al., 'Ultra-large library docking for discovering new chemotypes,' Nature, 2019, 566, 224–229.
[13] Wang, L. et al., 'Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field,' Journal of the American Chemical Society, 2015, 137(7), 2695–2703.
[14] Zheng, L. et al., 'OnionNet-2: a convolutional neural network model for predicting protein-ligand binding affinity based on residue-atom contacting shells,' Frontiers in Chemistry, 2021, 9, 753002.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# Reproduction Skill: Binding Affinity Prediction Gap Audit ## Overview Benchmark molecular docking scoring functions against experimental binding affinity data from PDBbind to quantify the accuracy gap in computational drug discovery. ## Prerequisites - Python 3.9+ with RDKit, scikit-learn, scipy, pandas, numpy - AutoDock Vina 1.2.3 (open source) - GOLD Suite with ChemScore (CCDC license required) - Schrödinger Suite with Glide SP (commercial license required) - RF-Score v4 (open source, https://github.com/oddt/rfscore) - PDBbind v2020 refined set (registration required at http://www.pdbbind.org.cn/) - ~500 GB disk for protein-ligand structures; ~200 CPU-hours for full rescoring ## Step 1: Data Acquisition and Preparation 1. Download PDBbind v2020 refined set (5,316 complexes with SDF and PDB files) 2. Parse the INDEX/INDEX_refined_data.2020 file to extract experimental binding affinity values 3. Convert all values to pAff = -log10(Ki or Kd in molar units) 4. Exclude IC50-only entries without Cheng-Prusoff correction data (discard ~312 complexes) 5. Map UniProt accessions to PFAM domains for protein family annotation using UniProt ID mapping API 6. Stratify complexes by family: kinases, proteases, nuclear receptors, phosphodiesterases, GPCRs, carbonic anhydrases, other ## Step 2: Rescoring Protocol For each of the 5,004 complexes, rescore the co-crystallized ligand pose (do NOT redock): **Vina:** ```bash vina --receptor protein_prep.pdbqt --ligand ligand.pdbqt --score_only --out scored.pdbqt ``` **Glide SP:** ```bash $SCHRODINGER/glide glide_score_only.in # with DOCKING_METHOD=score_in_place ``` **GOLD ChemScore:** Configure GOLD conf file with `score_param_file = chemscore.params` and `rescore_only = 1`. **RF-Score:** ```python from oddt.scoring.functions import RFScore sf = RFScore.load(version=4) score = sf.predict_ligand(protein, ligand) ``` ## Step 3: Analysis 1. Compute Pearson r, R², Spearman rho, and MAE between each scoring function's predictions and experimental pAff 2. Stratify all metrics by protein family 3. Compute signed residuals (predicted - experimental) and regress on molecular weight and rotatable bond count 4. Implement consensus scoring: z-score normalization, rank-sum, ECR, and trained linear combination (5-fold CV) 5. Bootstrap confidence intervals (10,000 iterations) for all correlation metrics 6. Compute B-factor normalized deviation for binding site flexibility analysis ## Step 4: Validation Checks - Verify that rescoring the crystal pose reproduces published Vina scores within 0.5 kcal/mol for 10 randomly selected PDBbind core set entries - Confirm RF-Score predictions match published CASF-2016 benchmarks on the core set - Check that consensus scoring weights sum to approximately 1.0 ## Expected Key Results - RF-Score best individual R² ~ 0.31, consensus R² ~ 0.38 - Kinases R² ~ 0.45, GPCRs R² ~ 0.12 - Systematic positive residual for MW > 500 Da ligands - MW regression slope ~ 0.004 pAff/Da ## Common Pitfalls - Forgetting to use score-only mode (redocking changes the pose and conflates pose prediction with scoring) - Not z-score normalizing before consensus (scoring functions have different scales) - Using IC50 values without Cheng-Prusoff correction introduces systematic bias - RF-Score training set overlap with PDBbind refined set inflates apparent performance; report separate results for overlapping and non-overlapping subsets
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.