Beyond the Hype: Genomic Foundation Models Show Mixed Results in Rigorous Evaluation
Key Takeaways
- ▸Genomic foundation models achieve genuine breakthroughs in variant effect prediction (e.g., Evo 2's noncoding SNV performance), but marketing claims about universal superiority across all genomic tasks do not hold up under rigorous testing
- ▸The GENEB benchmark reveals fundamental instability in how genomic models are evaluated: the same model can appear as a breakthrough in one paper and an underperformer in another due to lack of unified evaluation frameworks
- ▸On perturbation prediction and mechanistic interpretation tasks, simple linear baselines consistently outperform five foundation models and two other deep networks, indicating these models may not be the right approach for all genomic problems
Summary
A comprehensive analysis of genomic foundation models in 2026 reveals a stark divide between marketing claims and verified capabilities. Frontier models like Evo 2 and AlphaGenome excel at variant effect prediction—tasks where they have matched or exceeded specialist tools—but struggle significantly with perturbation response prediction, where simple linear baselines still outperform deep learning approaches. The analysis, conducted across the latest genomic literature, introduces the GENEB benchmark, which evaluated 40 genomic foundation models across 100 tasks and found that aggregate leaderboards are unstable, with model rankings varying sharply across different task categories. The research underscores a critical gap between vendor marketing (which highlights capability ledgers) and clinical utility (which requires validity ledgers based on held-out test sets with honest baselines). These findings highlight that model architecture and pretraining alignment often outweigh parameter count, challenging the industry assumption that scale alone drives progress.
- Proper evaluation methodology—using held-out test sets, honest baselines, and vendor-independent benchmarks—is critical to separate genuine capabilities from leaderboard theatre, essential for clinical adoption
Editorial Opinion
The genomic AI field has confused capability with validity. While Evo 2 and AlphaGenome represent real advances in variant prediction, this analysis reveals the dangerous gap between what models can do and what they should be trusted to do in clinical settings. The emergence of vendor-independent benchmarks like GENEB is a healthy correction—molecular pathologists need honest comparisons, not marketing ledgers. Until evaluation rigor becomes the norm, not the exception, foundation models will remain tools for specific tasks rather than universal replacements for specialist software.



