Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
Pre-trained Masked Autoencoders (MAE) have demonstrated strong performance on natural image benchmarks, but their utility for subcellular biology remains poorly characterized. We introduce OrgBoundMAE, a benchmark that evaluates MAE representations on organelle localization classification using the Human Protein Atlas (HPA) single-cell fluorescence image collection — 31,072 four-channel immunofluorescence crops covering 28 organelle classes. Our core hypothesis is that MAE's standard random patch masking at 75% is a poor proxy for biological reconstruction difficulty: it masks indiscriminately, forcing reconstruction of background cytoplasm rather than subcellular organization. We propose organelle-boundary-guided masking using Cellpose-derived boundary maps to preferentially mask patches at subcellular boundaries — regions of highest biological information density. We evaluate fine-tuned ViT-B/16 MAE against DINOv2-base and supervised ViT-B baselines, reporting macro-F1, feature effective rank (a diagnostic for dimensional collapse), and attention-map IoU against organelle masks. We show that boundary-guided masking recovers substantial macro-F1 relative to random masking at equivalent masking ratios, and that feature effective rank tracks this gap, confirming dimensional collapse as a mechanistic explanation for MAE's underperformance on rare organelle classes.
Pre-trained Masked Autoencoders (MAE) have demonstrated strong performance on natural image benchmarks, but their utility for subcellular biology remains poorly characterized. We introduce OrgBoundMAE, a benchmark that evaluates MAE representations on organelle localization classification using the Human Protein Atlas (HPA) single-cell fluorescence image collection — 31,072 four-channel immunofluorescence crops covering 28 organelle classes. Our core hypothesis is that MAE's standard random patch masking at 75% is a poor proxy for biological reconstruction difficulty: it masks indiscriminately, forcing reconstruction of background cytoplasm rather than subcellular organization. We propose organelle-boundary-guided masking using Cellpose-derived boundary maps to preferentially mask patches at subcellular boundaries — regions of highest biological information density. We evaluate fine-tuned ViT-B/16 MAE against DINOv2-base and supervised ViT-B baselines, reporting macro-F1, feature effective rank (a diagnostic for dimensional collapse), and attention-map IoU against organelle masks. We show that boundary-guided masking recovers substantial macro-F1 relative to random masking at equivalent masking ratios, and that feature effective rank tracks this gap, confirming dimensional collapse as a mechanistic explanation for MAE's underperformance on rare organelle classes.
Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
Diversity-aware training data curation has recently been shown to outperform naive data scaling
for histopathology pre-training, yet no systematic study exists for fluorescence microscopy
fine-tuning — a domain with fundamentally different spatial statistics (4-channel single-cell
crops, 28 organelle classes, extreme class imbalance). We benchmark five curation strategies —
random sampling, k-Center Greedy coreset, Furthest Point Sampling (FPS), class-balanced oracle
selection, and a novel domain-specific BIO-Diversity score combining per-channel entropy with
patch-level boundary coverage — across four training data fractions (25%–100%) of the HPA
Single-Cell Classification dataset. At 50% of training data, BIO-Diversity selection matches the
macro-F1 of training on 75% of randomly sampled data and narrows the gap to the oracle by 62%,
while also doubling the effective rank of learned representations compared to random sampling at
equal budget. Our results demonstrate that morphological diversity metrics derived from biological
priors (channel balance and organelle boundary coverage) are strong proxies for training sample
utility in fluorescence microscopy fine-tuning.
Pre-trained Masked Autoencoders (MAE) have demonstrated strong performance on natural image benchmarks, but their utility for subcellular biology remains poorly characterized. We introduce OrgBoundMAE, a benchmark that evaluates MAE representations on organelle localization classification using the Human Protein Atlas (HPA) single-cell fluorescence image collection — 31,072 four-channel immunofluorescence crops covering 28 organelle classes. Our core hypothesis is that MAE's standard random patch masking at 75% is a poor proxy for biological reconstruction difficulty: it masks indiscriminately, forcing reconstruction of background cytoplasm rather than subcellular organization. We propose organelle-boundary-guided masking using Cellpose-derived boundary maps to preferentially mask patches at subcellular boundaries — regions of highest biological information density. We evaluate fine-tuned ViT-B/16 MAE against DINOv2-base and supervised ViT-B baselines, reporting macro-F1, feature effective rank (a diagnostic for dimensional collapse), and attention-map IoU against organelle masks. We show that boundary-guided masking recovers substantial macro-F1 relative to random masking at equivalent masking ratios, and that feature effective rank tracks this gap, confirming dimensional collapse as a mechanistic explanation for MAE's underperformance on rare organelle classes.
Pre-trained Masked Autoencoders (MAE) have demonstrated strong performance on natural image benchmarks, but their utility for subcellular biology remains poorly characterized. We introduce OrgBoundMAE, a benchmark that evaluates MAE representations on organelle localization classification using the Human Protein Atlas (HPA) single-cell fluorescence image collection — 31,072 four-channel immunofluorescence crops covering 28 organelle classes. Our core hypothesis is that MAE's standard random patch masking at 75% is a poor proxy for biological reconstruction difficulty: it masks indiscriminately, forcing reconstruction of background cytoplasm rather than subcellular organization. We propose organelle-boundary-guided masking using Cellpose-derived boundary maps to preferentially mask patches at subcellular boundaries — regions of highest biological information density. We evaluate fine-tuned ViT-B/16 MAE against DINOv2-base and supervised ViT-B baselines, reporting macro-F1, feature effective rank (a diagnostic for dimensional collapse), and attention-map IoU against organelle masks. We show that boundary-guided masking recovers substantial macro-F1 relative to random masking at equivalent masking ratios, and that feature effective rank tracks this gap, confirming dimensional collapse as a mechanistic explanation for MAE's underperformance on rare organelle classes.