Sequence modeling and design from molecular to genome scale with Evo, 2024, Nguyen et al.

Discussion in 'Other health news and research' started by SNT Gatchaman, Nov 14, 2024.

  1. SNT Gatchaman

    SNT Gatchaman Senior Member (Voting Rights) Staff Member

    Messages:
    6,700
    Location:
    Aotearoa New Zealand
    Sequence modeling and design from molecular to genome scale with Evo
    Eric Nguyen; Michael Poli; Matthew G. Durrant; Brian Kang; Dhruva Katrekar; David B. Li; Liam J. Bartie; Armin W. Thomas; Samuel H. King; Garyk Brixi; Jeremy Sullivan; Madelena Y. Ng; Ashley Lewis; Aaron Lou; Stefano Ermon; Stephen A. Baccus; Tina Hernandez-Boussard; Christopher Ré; Patrick D. Hsu; Brian L. Hie

    INTRODUCTION
    The fundamental instructions of life are encoded in the DNA sequences of all living organisms. Understanding these instructions could unlock deeper insights into biological processes and enable new ways to reprogram biology into useful technologies. However, even the simplest microbial genomes are incredibly complex, with millions of DNA base pairs encoding the interplay of DNA, RNA, and proteins—the three modalities of the so-called central dogma of molecular biology and the key actors in cellular function. This complexity exists at multiple scales, from individual molecules to whole genomes, representing a vast landscape of genetic information that has been functionally selected over evolutionary time.

    RATIONALE
    Rapid progress in artificial intelligence (AI) has led to large language models that demonstrate increasingly advanced multitask reasoning and generation capabilities when trained on massive amounts of data. However, technological limitations in the architecture of these models have restricted efforts to apply them to biology at a similar scale. Current approaches struggle to analyze sequences at the individual character level and are computationally demanding when applied to long sequences. An advanced model maintaining single-nucleotide resolution over large genomic sequences could potentially extract functional information about the complex molecular interactions that are embedded in the patterns of natural evolutionary variation.

    RESULTS
    In this work, we present Evo, a genomic foundation model that enables prediction and generation tasks from the molecular to the genome scale. Using an architecture based on advances in deep signal processing, we scaled Evo to 7 billion parameters with a context length of 131 kilobases at single-nucleotide resolution. We report scaling laws on DNA, complementing similar observations in natural language and vision. Trained on 2.7 million prokaryotic and phage genomes, Evo demonstrates zero-shot function prediction across DNA, RNA, and protein modalities that is competitive with—or outperforms—domain-specific language models. Evo also excels at multimodal generation tasks, which we demonstrated by generating synthetic CRISPR-Cas molecular complexes and transposable systems. We experimentally validated the functional activity of Evo-generated CRISPR-Cas molecular complexes as well as IS200 and IS605 transposable systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Using information learned over whole genomes, Evo learns how small changes in nucleotide sequence affect whole-organism fitness and can generate DNA sequences with plausible genomic architecture more than 1 megabase in length.

    CONCLUSIONS
    Evo is a foundation model that is designed to capture two fundamental aspects of biology: the multimodality of the central dogma and the multiscale nature of evolution. The central dogma integrates DNA, RNA, and proteins with a unified code and predictable information flow, whereas evolution unifies the vastly different length scales of biological function represented by molecules, pathways, cells, and organisms. Evo learns both of these representations from the whole-genome sequences of millions of organisms to enable prediction and design tasks from the molecular to genome scale. Further development of large-scale biological sequence models like Evo, combined with advances in DNA synthesis and genome engineering, will accelerate our ability to engineer life.

    EDITOR’S SUMMARY
    Large language models have great potential to interpret biological sequence data. Nguyen et al. present Evo, a multimodal artificial intelligence model that can interpret and generate genomic sequences at a vast scale. The Evo architecture leverages deep learning techniques, enabling it to process long sequences efficiently. By analyzing millions of microbial genomes, Evo has developed a comprehensive understanding of life’s complex genetic code, from individual DNA bases to entire genomes. This enables the model to predict how small DNA changes affect an organism’s fitness, generate realistic genome-length sequences, and design new biological systems, including laboratory validation of synthetic CRISPR systems and IS200/IS605 transposons. Evo represents a major advancement in our capacity to comprehend and engineer biology across multiple modalities and multiple scales of complexity (see the Perspective by Theodoris).

    Link | PDF (Science)
     
    hotblack likes this.
  2. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    643
    Location:
    UK
    Really interesting. I remember Demis Hassabis talking about how AI/ML is well suited to problems with large search spaces (AlphaGo, AlphaFold) and how the he hopes it could be the language of biology in a similar way maths has been for physics. AlphaFold seems to be finding uses as part of pipelines before scientists build stuff in wet labs. There’s some fascinating work taking place.
     
    SNT Gatchaman likes this.

Share This Page