MSAgent: An Evidence Grounded Agentic Framework for LLM-driven Scientific Exploration in Mass Spectrometry-based Metabolomics

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

MSAgent: An Evidence Grounded Agentic Framework for LLM-driven Scientific Exploration in Mass Spectrometry-based Metabolomics

Authors

Li, Y.; Zhong, Y.; Liu, P.; Yusheng, T.; Zhan, H.; Xia, J.

Abstract

Mass spectrometry (MS) is a cornerstone high-throughput technology for molecular discovery, yet the reliable elucidation of chemical structures remains a formidable, expert-dependent bottleneck. Currently, achieving a reliable molecular identification from raw mass spectra necessitates a manual assembly, a labor-intensive ordeal of heuristic reasoning and the tedious integration of siloed computational tools, perpetuating a profound throughput gap between rapid data acquisition and the glacial pace of structural annotation. Here we present MSAgent, an autonomous agentic framework that bridges the gap between computational automation and expert intuition by emulating the cognitive logic of human specialists. By orchestrating a MSToolbox of over 50 domain-specific tools via Large Language Models (LLMs), MSAgent dynamically unifies the analytical pipeline into a scalable, evidence-grounded workflow, allowing for intent-aware planning, cross-resources outputs synthesis, and visual mechanistic interpretation within traceable reasoning chains and evidence-backed analytical reports. We evaluated MSAgent across multiple open benchmarks, including the established community challenges - Critical Assessment of Small Molecule Identification (CASMI) 2016/2022, CANOPUS, and LLM-oriented test cases. On CASMI, MSAgent consistently boosts retrieval performance by over 10% MRR across diverse benchmarks while ensuring high reliability, improving or preserving ranks in 95% of cases. For more challenging molecular de novo tasks on CANOPUS, MSAgent builds upon the outputs of baseline models with consistent refinement, yielding over a 40% average gain in Tanimoto similarity for ground-truth recovery. In addition, MSAgent demonstrates remarkable advantages in eliminating the hallucination phenomenon over LLMs without domain tool support, producing better-calibrated confidence (Pearson r = 0.438 vs -0.219 for gpt-4o). It improves exact-match rate by 38.8% over gpt-4o in candidate discrimination tasks, and achieved a 64% success rate in recommending high-quality candidate structures with Tanimoto similarity more than 0.7, where gpt-4o predominantly selected candidates with similarity below 0.3. Our work enables high-throughput mass spectrometry data to be analyzed in an intent-driven and automated manner, lowering the analysis barrier for no-expert to deliver molecular identification result with transparent analytical process, and accelerating discovery in metabolism and related fields by bridging the gap between experimental data acquisition and computational interpretation.

Follow Us on

0 comments

Add comment