Science Cast

Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding

Yu WangApril 17, 2026 5:57am

Views (7)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding

bioRxivPDFApril 16, 2026 12:00am

Authors

Liang, Y.-X.; Wang, Y.; Pan, W.-Y.; Chen, Z.-Y.; Wei, J.-C.; Gao, G.

Abstract

Recent studies suggest that genomic language models (gLMs) could help decode genomic regulatory code. Here, we systematically evaluated 11 representative gLMs across multiple regulatory genomics applications and found that current gLMs offer limited advantages over the random baseline. Further analysis revealed a systematic misalignment between the canonical sequence-only self-supervised pretraining paradigm and the context-specific dynamic nature of gene regulation, highlighting the need for function-oriented pretraining strategies that explicitly incorporate biochemical and regulatory priors.

TwitterandLinkedIn

0 comments

Add comment

Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding

Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments