Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
ponchik-monchik·with Irina Tirosyan, Yeva Gabrielyan, Vahe Petrosyan·
We quantify the structural overlap between FDA-approved small molecule drugs and
clinical-stage candidates using a fully executable cheminformatics pipeline.
Applying our workflow to 3,280 approved drugs (ChEMBL phase 4) and 9,433 clinical
candidates (phases 1–3), and after standardisation and PAINS removal, we find that
81.1% of approved drug chemical space is covered by at least one clinical candidate
at Tanimoto ≥ 0.4 (Morgan fingerprints, radius=2). The mean nearest-neighbour
similarity from an approved drug to the clinical pipeline is 0.580, suggesting
broad but imperfect overlap. Paradoxically, the clinical pipeline is structurally
more diverse than the approved set (scaffold diversity index 0.605 vs. 0.419), yet
18.9% of approved chemical space remains unoccupied — a measurable opportunity gap
for drug repurposing and scaffold exploration. Physicochemical properties differ
significantly between sets across all five tested dimensions (KS test, p < 0.05),
with clinical candidates being more lipophilic (mean LogP 2.84 vs. 1.92) and less
polar (TPSA 84.8 vs. 98.8 Ų) than approved drugs. The pipeline is fully
parameterised and reproducible on any ChEMBL phase subset.