Evo 2: Open-Source AI Trained on Trillions of DNA Bases Can Decode Complex Genomes
Key Takeaways
- ▸Evo 2 is trained on 8.8 trillion DNA bases from bacteria, archaea, and eukaryotes, making it capable of analyzing complex genomes including human DNA
- ▸The system can identify subtle genomic features like splice sites and regulatory sequences that existing tools struggle to detect accurately
- ▸Training involved two stages: initial 8,000-base chunks for feature learning, then million-base sequences for large-scale pattern recognition
Summary
Researchers have released Evo 2, an open-source AI system trained on 8.8 trillion DNA bases spanning all three domains of life—bacteria, archaea, and eukaryotes. Building on its predecessor Evo, which focused on bacterial genomes, Evo 2 tackles the significantly more complex task of interpreting eukaryotic genomes like those of humans. The system can identify genes, regulatory sequences, splice sites, and other genomic features that are often challenging even for specialized bioinformatics tools to detect accurately.
Evo 2 uses a convolutional neural network architecture called StripedHyena 2 and was trained in two stages using the OpenGenome2 dataset. The initial training phase focused on 8,000-base chunks to teach feature recognition, followed by a second phase processing sequences up to one million bases long to identify large-scale genomic patterns. Unlike bacterial genomes with their straightforward gene organization, eukaryotic genomes contain interrupted coding sequences (introns), scattered regulatory elements, and vast amounts of non-coding DNA, making pattern recognition extraordinarily difficult.
The AI developed internal representations of key genomic features including weakly-defined sequences like splice sites and transcription factor binding sites, which have probabilistic rather than absolute base requirements. Notably, the researchers excluded viruses that attack eukaryotes from the training data due to biosafety concerns about potential misuse for creating human pathogens. The release of Evo 2 as open-source software represents a significant advancement in computational genomics, potentially accelerating genome annotation, comparative genomics, and our understanding of gene regulation across the tree of life.
- The model is released as open-source, though training data excluded eukaryotic viruses to prevent potential biosecurity risks
Editorial Opinion
Evo 2 represents a watershed moment in computational biology, demonstrating that foundation models can tackle the messy complexity of real-world genomes rather than just the tidy patterns found in bacterial DNA. The decision to release this powerful tool as open-source while thoughtfully excluding potentially dangerous viral sequences shows responsible innovation in action. This could democratize advanced genome analysis capabilities that were previously accessible only to well-funded institutions, potentially accelerating discoveries in medicine, agriculture, and evolutionary biology across the research community.



