SNPic: SNP Topic Modeling for Interpretable Clustering of Complex phenotypes
SNPic: SNP Topic Modeling for Interpretable Clustering of Complex phenotypes
Leyi, Z.; Seiler, C.; Speed, D.; Micheroli, R.; Ospelt, C.
AbstractGenome-wide association studies (GWAS) have cataloged thousands of disease-associated variants, yet a central challenge remains: decoding the shared, pleiotropic architecture that links complex phenotypes. Existing approaches, including dimensionality reduction methods and multivariate genetic models, either lack interpretability or rely on external linkage disequilibrium (LD) reference panels, limiting their ability to recover coherent biological mechanisms. Here we introduce the SNP topic model (SNPic), a generative probabilistic framework that reframes GWAS summary statistics as a structured corpus and models genetic architecture using principles from Natural Language Processing (NLP). By treating phenotypes as documents and genes or the whole corpus of traits as words, SNPic applies topic models, e.g. Latent Dirichlet Allocation (LDA), to infer latent "genetic topics", representing interpretable, overlapping biological modules that jointly explain complex traits. This formulation enables simultaneous reconstruction of trait relationships and identification of their underlying molecular drivers. SNPic integrates two complementary schemes: Sumstat-as-word for capturing global phenotypic structure and Gene-as-word for resolving mechanistic detail, within a unified modeling framework. To ensure robustness, we introduce a stability-optimized inference pipeline based on bootstrap resampling, allowing data-driven selection of topic number and filtering of stochastic signals. Across extensive simulations, SNPic consistently outperforms conventional dimensionality reduction methods in recovering latent structure under both linear and non-linear, highly overlapping genetic architectures. Applied to integrated FinnGen and UK Biobank datasets, SNPic identifies reproducible genetic topics corresponding to distinct biological programs, including HLA-mediated immune processes and transporter-driven metabolic regulation, with strong tissue-specific support. The framework further generalizes across species, organizing complex traits in maize, Arabidopsis thaliana, and cattle into biologically coherent modules. Together, these results establish SNPic as a scalable and interpretable framework that shifts GWAS analysis from association cataloging toward the construction of an interpretable knowledge graph representing the latent semantic architecture of the genome. By unifying statistical genetics with NLP, SNPic reframes GWAS analysis as a probabilistic language modeling task, enabling the systematic decoding of complex trait architectures and delivering a systemic graph of cross-phenotype relationships.