Genome modelling and design across all domains of life with Evo 2 2026 Brixi et al

Jaybee00

Senior Member (Voting Rights)
Genome modelling and design across all domains of life with Evo 2, 2026, Brixi et al

Genome modelling and design across all domains of life with Evo 2

Brixi, Garyk; Durrant, Matthew G.; Ku, Jerome; Naghipourfar, Mohsen; Poli, Michael; Sun, Gwanggyu; Brockman, Greg; Chang, Daniel; Fanton, Alison; Gonzalez, Gabriel A.; King, Samuel H.; Li, David B.; Merchant, Aditi T.; Nguyen, Eric; Ricci-Tam, Chiara; Romero, David W.; Schmok, Jonathan C.; Taghibakhshi, Ali; Vorontsov, Anton; Yang, Brandon; Deng, Myra; Gorton, Liv; Nguyen, Nam; Wang, Nicholas K.; Pearce, Michael T.; Simon, Elana; Adams, Etowah; Amador, Zachary J.; Ashley, Euan A.; Baccus, Stephen A.; Dai, Haoyu; Dillmann, Steven; Ermon, Stefano; Guo, Daniel; Herschl, Michael H.; Ilango, Rajesh; Janik, Ken; Lu, Amy X.; Mehta, Reshma; Mofrad, Mohammad R. K.; Ng, Madelena Y.; Pannu, Jaspreet; Ré, Christopher; St. John, John; Sullivan, Jeremy; Tey, Joseph; Viggiano, Ben; Zhu, Kevin; Zynda, Greg; Balsam, Daniel; Collison, Patrick; Costa, Anthony B.; Hernandez-Boussard, Tina; Ho, Eric; Liu, Ming-Yu; McGrath, Thomas; Powell, Kimberly; Pinglay, Sudarshan; Burke, Dave P.; Goodarzi, Hani; Hsu, Patrick D.; Hie, Brian L.

Abstract
All of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2.

Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific fine-tuning.

Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search.

We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Web | DOI | PDF | Nature
 
Last edited by a moderator:
In case anyone has some enthusiasm for further exploration —

Code and tools for model exploration are available at the following links:

The following model parameters are available on Hugging Face:

Evo 2 40B: https://huggingface.co/arcinstitute/evo2_40b
Evo 2 7B: https://huggingface.co/arcinstitute/evo2_7b
Evo 2 40B base: https://huggingface.co/arcinstitute/evo2_40b_base
Evo 2 7B base: https://huggingface.co/arcinstitute/evo2_7b_base
Evo 2 1B base: https://huggingface.co/arcinstitute/evo2_1b_base
Evo 2 layer-26 mixed prokaryotic/eukaryotic SAE: https://huggingface.co/Goodfire/Evo-2-Layer-26-Mixed
Exon classifier trained on Evo 2 7B base embeddings: https://huggingface.co/schmojo/evo2-exon-classifier
 
Looks very interesting. And potentially relevant if used for annotation and helping finding the impact (on say transcription) of SNPs but possibly just more widely understanding the genome. Between this and AlphaGenome it looks like there’s some really interesting and potentially important work being done to help understand the language of biology.

There’s also an article on Ars which may be more accessible to people than the paper

To test the system, the researchers started making single-base mutations and fed them into Evo 2 to see how it responded. Evo 2 could detect problems when the mutations affected the sites in DNA where transcription into RNA started, or the sites where translation of that RNA into protein started. It also recognized the severity of mutations. Those that would interrupt protein translation, such as the introduction of stop signals, were identified as more significant changes than those that left the translation intact.

It also recognized when sequences weren’t translated at all. Many key cellular functions are carried out directly by RNAs, and Evo 2 was able to recognize when mutations disrupted those, as well.

The big open question is whether this system has identified anything that we don’t know how to test for. Things like intron/exon boundaries and regulatory DNA have been subjected to decades of study so that we already knew how to look for them and can recognize when Evo 2 spots them. But we’ve discovered a steady stream of new features in the genome—CRISPR repeats, microRNAs, and more—over the past decades. It remains technically possible that there are features in the genome we’re not aware of yet, and Evo 2 has picked them out.

It’s possible to imagine ways to use the tools described here to query Evo 2 and pick out new genome features. So I’m looking forward to seeing what might ultimately come out of that sort of work
 
Back
Top Bottom