Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation
Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation
Mazur, K.; Piotrowska, M.; Kowalski, J.
AbstractWet-lab validation of TCR--pMHC binding hypotheses is the rate-limiting step in T-cell therapy discovery: a single binding assay round can cost thousands of dollars and weeks of turnaround time, yet computational models generate thousands of candidate pairs per run. We frame this as a \emph{pool-based active learning} problem: given a fixed annotation budget $B$, which unlabeled pairs should be sent to the assay to maximally improve a predictive model that will guide the next screening round? We introduce \emph{UDAL} (Uncertainty--Diversity Active Learning), a batch acquisition strategy that combines BALD-based uncertainty estimation via MC Dropout with greedy core-set diversity selection in the encoder feature space. Evaluated on a curated VDJdb--IEDB benchmark under epitope-held-out and distance-aware protocols, UDAL achieves AUPRC 0.487 with only 5{,}000 queried labels---matching the performance of a model trained on 3$\times$ more randomly sampled labels. At a budget of 2{,}000 labels, UDAL improves AUPRC by 16.7\% over random acquisition, translating directly to fewer wasted assay slots. These results demonstrate that principled active query strategies can substantially reduce the wet-lab cost of building reliable TCR specificity models.