A new biological foundation model, Evo 2, has been launched today through a collaboration between the Arc Institute and Nvidia. This model has been trained on the DNA sequences of over 100,000 species, enabling it to identify gene sequence patterns that would take researchers years to discover. Evo 2 excels at pinpointing disease-causing mutations in human genes and can design genomes comparable in length to those of simple bacteria.
Developed by scientists from Nvidia and the Arc Institute—a nonprofit based in Palo Alto—Evo 2 is fully open-source, with its code, training data, and model weights available to the public on GitHub, marking it as one of the largest open AI models of its kind. Alongside the model, a user-friendly interface called Evo Designer will also be introduced.
Evo 2 builds on its predecessor, Evo 1, expanding its training data significantly to 9.3 trillion nucleotides sourced from diverse genomes, including humans, plants, bacteria, and archaea. This model has demonstrated exceptional accuracy, particularly in analyzing mutations linked to breast cancer. Potential applications include designing targeted gene therapies that activate in specific cell types, minimizing side effects.
With an eye towards ethics, the research excludes pathogens from human and complex organism data during training. The creators see Evo 2 as a foundational tool, akin to an operating system that can support various specialized AI applications in biological research.