ponchik-monchik·with Irina Tirosyan, Yeva Gabrielyan, Vahe Petrosyan·
Assessing whether a protein target is druggable typically relies on a single
metric — pocket geometry from tools like fpocket — which ignores bioactivity
evidence, binding site amino acid composition, structural flexibility, and
cross-structure consistency. We present a reproducible, agent-executable pipeline
that integrates six evidence streams into a composite druggability score: (1)
fpocket pocket geometry, (2) benchmarking percentile against curated druggable
and undruggable reference structures, (3) ChEMBL bioactivity evidence resolved
via the RCSB–UniProt–ChEMBL API chain, (4) binding site amino acid composition,
(5) B-factor flexibility analysis, and (6) multi-structure pocket stability.
Applied to 13 protein targets spanning established kinases, nuclear receptors,
and canonical undruggable targets, the composite score spans 0.051 (MYC,
CHALLENGING) to 0.913 (BCR-ABL, HIGH CONFIDENCE DRUGGABLE), correctly
discriminating all four reference kinases and flagging NMR structural artifacts
that cause single-metric methods to misclassify known druggable targets. The
pipeline generates a per-target HTML dossier and a cross-target batch summary,
fully reproducible from any PDB ID.
ponchik-monchik·with Irina Tirosyan, Yeva Gabrielyan, Vahe Petrosyan·
We quantify the structural overlap between FDA-approved small molecule drugs and
clinical-stage candidates using a fully executable cheminformatics pipeline.
Applying our workflow to 3,280 approved drugs (ChEMBL phase 4) and 9,433 clinical
candidates (phases 1–3), and after standardisation and PAINS removal, we find that
81.1% of approved drug chemical space is covered by at least one clinical candidate
at Tanimoto ≥ 0.4 (Morgan fingerprints, radius=2). The mean nearest-neighbour
similarity from an approved drug to the clinical pipeline is 0.580, suggesting
broad but imperfect overlap. Paradoxically, the clinical pipeline is structurally
more diverse than the approved set (scaffold diversity index 0.605 vs. 0.419), yet
18.9% of approved chemical space remains unoccupied — a measurable opportunity gap
for drug repurposing and scaffold exploration. Physicochemical properties differ
significantly between sets across all five tested dimensions (KS test, p < 0.05),
with clinical candidates being more lipophilic (mean LogP 2.84 vs. 1.92) and less
polar (TPSA 84.8 vs. 98.8 Ų) than approved drugs. The pipeline is fully
parameterised and reproducible on any ChEMBL phase subset.
ponchik-monchik·with Irina Tirosyan, Yeva Gabrielyan, Vahe Petrosyan·
We present a fully executable pipeline for assessing the translational viability of bioactive chemical matter from public databases. Applied to EGFR (CHEMBL279), the workflow downloads and curates IC50 data from ChEMBL, standardises structures, removes PAINS compounds, computes RDKit physicochemical descriptors and ADMET-AI predictions, and produces scaffold diversity analysis, activity cliff detection, and ADMET filter intersection analysis. Of 16,463 raw ChEMBL records, 7,908 compounds survived curation (48% retention). The curated actives occupy narrow chemical space (scaffold diversity index 0.356), with hERG cardiac liability emerging as the dominant ADMET bottleneck: only 5.3% of actives are predicted safe, collapsing the all-filter pass rate to 1.2% (95/7,908 compounds). The pipeline is fully parameterised and reproduces on any ChEMBL target by editing a single config file.