Evo 2: AI Foundation Model Trained on 9 Trillion DNA Base Pairs Achieves Genome-Scale Design Across All Life
Key Takeaways
- ▸Evo 2 is trained on 9 trillion DNA base pairs with a 1 million token context window, covering all domains of life at single-nucleotide resolution
- ▸The model predicts functional impacts of genetic variations, including pathogenic mutations and BRCA1 variants, without task-specific fine-tuning
- ▸Evo 2 generates genome-scale sequences for mitochondrial, prokaryotic, and eukaryotic organisms with experimentally validated results
Summary
Researchers have unveiled Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a comprehensive genomic atlas spanning all domains of life. Published in Nature, the model features a 1 million token context window with single-nucleotide resolution, enabling unprecedented capabilities in predicting functional impacts of genetic variation and generating novel genomic sequences. The model demonstrates the ability to accurately predict effects of genetic changes—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without requiring task-specific fine-tuning.
Evo 2's mechanistic interpretability analyses reveal sophisticated biological understanding, with learned representations associated with exon-intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. The model's generative capabilities extend to producing mitochondrial, prokaryotic, and eukaryotic sequences at genome scale, showing greater naturalness and coherence than previous methods. When guided by predictive models and inference-time search, Evo 2 successfully generates experimentally validated chromatin accessibility patterns.
In a significant move for scientific collaboration, the research team has made Evo 2 fully open source, releasing model parameters, training code, inference code, and the OpenGenome2 dataset. This comprehensive release aims to accelerate exploration and design of biological complexity across the research community. The model represents a major advancement in applying artificial intelligence to genomics, potentially transforming how researchers understand and engineer biological systems across all forms of life.
- Full open-source release includes model weights, training code, inference code, and the OpenGenome2 dataset
- Mechanistic analysis shows the model learns biologically meaningful representations of genomic features like exon-intron boundaries and transcription factor binding sites
Editorial Opinion
Evo 2 represents a watershed moment in computational biology, demonstrating that foundation models can achieve meaningful biological understanding at unprecedented scale. The decision to fully open-source both the model and the 9 trillion base pair training dataset is particularly commendable, potentially democratizing access to cutting-edge genomic AI tools. However, the real test will be whether Evo 2's predictions translate consistently to experimental validation across diverse biological contexts, and whether the research community can responsibly navigate the dual-use implications of AI-powered genome design capabilities.



