[0:00] This is one of the most underrated AI breakthroughs so far. These researchers made an AI that can understand the code of life, DNA. In fact, they can use this to detect cancer or other diseases, or it can even predict the effects of genetic mutations, and it can even generate the complete DNA of a living thing. The implications of this are wild. This technology could help scientists engineer better crops and unlock new solutions for energy and food security. And this is also great for personalized medicine. We could even use [0:32] this to design entirely new species or even make ourselves better. In this video, I'm going to break down exactly how they did this, how it works, and some insane findings from this project. Now, this is a very technical paper, but as always, I'm going to break this down into simple terms so that anyone can understand. Let's jump right in. Now, to understand what they did in this paper, let's start with the basics. You see, current AI models that we use today like Chat GPT and Gemini, these are large language models. They're designed to do [1:03] really good at understanding and outputting natural language. That's why it can generate a beautiful poem or a flawless essay or even do a full medical research report for you. But here's the interesting question. Instead of the natural language that you and I use to communicate, you know what else is similar? DNA or the language of life. What if we took the same logic from these large language models and apply it to train an AI that can understand and generate DNA? Well, that's exactly what [1:34] these researchers did. So, the paper is called genome modeling and design across all domains of life with EVO 2. And this was published recently in nature which is like the most prestigious scientific journal in the world. Now, before we dive into why this is so important, we first need to understand what DNA is. You see, DNA is like the instruction manual for life. Inside almost every cell in every living thing, there's DNA. And it stores information needed to build and run the organism. Now, DNA is [2:06] made of only four letters G, C, A or T. And these letters bond in pairs. So, G bonds with C and A bonds with T and vice versa. So from this bonding it creates two strands like this which wind together in a spiral or double helix shape like this. So this is the classic DNA shape that you might be familiar with. Now the sequence of these letters is what contains the instructions of life. It encodes thousands of traits and biological processes. For example, for [2:37] humans this includes eye color, height, risk of disease, our immune system, how we metabolize, and more. It's the blueprint for how living things function and stay alive. All right. Now, back to this paper here. They created this EVO2 biological foundation model trained on 9 trillion DNA base pairs. So, this is actually very similar to large language models like Chat GPT. But instead of being trained on all the data from the internet, this EVO 2 was trained on 9 [3:08] trillion DNA base pairs. Specifically, they trained it on a data set called Open Genome 2. And this data set essentially takes millions of diverse living things from across the entire spectrum of life and puts their DNA into a massive digital library. So, as you can see, this includes everything from bacteria, plants, fungi, and animals, including of course humans. The total size of this is like 9 trillion DNA base pairs. So afterwards they used all this DNA to train an AI model. And here's the [3:40] technical detail that really stands out here. It says that this EVO2 model has a million token context window with single nucleotide resolution. In simple terms, that means the AI can hold 1 million DNA letters in its prompt or working memory at once. Now, why would we need that? After all, a single gene might only be a few thousand letters long. Shouldn't that be enough? The problem is that biology almost never works in isolation. Here's an example. You see, a gene, which is just a section of DNA, might [4:12] contain the instructions to do something, but whether that gene is actually activated or how strongly it's executed or when it turns on often depends on other regions of DNA called regulatory elements. And these elements aren't always nearby. They might be hundreds of thousands of letters away along the DNA strand. So the instructions and the controls can be separated by enormous distances. If the model's context window is small, say only a few hundred thousand tokens, it might not be able to capture all the [4:43] relevant parts of a DNA strand. Because this new EVO2 has a much larger million token context window, it can understand context across DNA a lot better. Now, holding that much information is one thing, but actually being able to understand, remember, and analyze all this data is something else. To prove their system could actually handle a million DNA letters at once, the researchers ran a simple but brutal test. Something called a needle in haststack test. It's a very standard test in AI and basically they generated [5:14] a completely random and meaningless sequence of 1 million DNA letters. That's the haystack. Then they took a very specific 100let sequence and hid it somewhere inside this massive random jumble of letters. This 100let sequence is the needle. and they asked Evo 2 to find it. And in short, EVO 2 was able to find it perfectly. This proves that it's not just skimming the sequence or forgetting stuff in this 1 million letter sequence. It's actually reading and retaining the entire thing. So, at [5:46] this point, we know a few things. We know this model was trained on 9 trillion DNA letters across the entire spectrum of life from extreme bacteria to human cells. We know that it can remember what it reads even if it's given a million letters. But here's the deeper question. Can it actually recognize patterns? Can it actually understand the language of DNA? You see, when Evo2 was trained, it wasn't given any labels or answers. It was never told that this particular sequence of DNA triggers a disease or this one is a [6:17] mutation that could kill the cell or that this one, for example, controls leaf color. The model never received these labels or answers. It simply read raw unlabeled DNA like this. So how can we tell whether it actually understands DNA? This is where a concept called zeroshot prediction comes in. Zero shot means the AI can answer something without being explicitly trained on it. But that raises an obvious question. If the model was never taught what a disease is, how could it possibly know [6:48] that a certain snippet of DNA would cause disease? The answer lies in evolution. If a genetic sequence is absolutely essential for life, for example, the proteins that allow a cell to generate energy, evolution tends to keep that sequence from change. You see, all the living things in this data set are alive because its DNA works and is functional. If a mutation messed up the DNA in a way that could cause disease or death, well, that organism dies and that mutation wouldn't appear in this gene pool. So over millions of years of [7:19] evolution, this process leaves a clear signal in the data. When Evo 2 reads these 9 trillion DNA letters from across the entire spectrum of life, it begins to notice these patterns. If it notices the exact same sequence of DNA in many different species, it would signal that this sequence is probably very important. And conversely, if it has never seen a sequence before in this spectrum of life, it might mean that that sequence could be harmful or cause death. Anyways, the researchers had to prove this. They had to show that Evo 2 [7:52] actually understood the language of DNA. So to test it, they took a real DNA sequence and changed one letter and then asked the model a simple question. Based on your understanding, how likely will this DNA exist in nature? If EVO assigns an extremely low probability, the model essentially is raising a red flag. It's saying like this mutation is probably harmful because it wouldn't exist in nature. Well, when researchers tested EVO 2 this way, they found some incredible things. The model was [8:23] recognizing very specific biological signals without ever being taught this. For example, it correctly identified mutations in start codons and stop codons. To understand why this matters, we need to briefly look at how cells read DNA. You see, DNA is interpreted in groups of three letters called codons. Now, a start codon, usually the sequence ATG, tells the cell's machinery where to begin building a protein. It basically means start here. A stop codon does the opposite. It tells the cell when to stop [8:56] building. So, if you mutate the start codon, in other words, if you mess up the letters, the protein is never produced. And if you mutate the stop code on, then the cell just keeps working and working far past the intended endpoint. It just produces something that often does not work. So when scientists added mutations to either the start or stop codons, EVO 2 was able to correctly flag this mutation as highly destructive, which is correct. And it was even able to understand more subtle things. So here's another insane [9:27] finding. The researchers found that the model could identify things like the Shine Delgarno sequence and the KAC sequence. Now, what on earth are these? Think of them as like landing pads for the cell's protein building machinery. You see, before a protein can be made, something called a ribosome has to attach to it at exactly the right location. And if it docks in the wrong place, the entire protein might not be built correctly or even at all. So the cell has to also code up this landing pad region. And for bacteria, this [9:59] region is called the shine delgo sequence. And then for more complex organisms such as humans and plants, this sequence is called the KAC sequence. These sequences sit basically next to the start of the gene and basically guide the ribosome to the correct starting point. Without them, the ribosome might simply drift past this and never latch on to it to make a protein. What's remarkable is that the AI also figured out these landing pads purely from context. If the scientists added mutations in these areas, the AI [10:29] was able to successfully flag this as highly damaging. And there's more. It also figured out the difference between what is called a synonymous mutation as well as a frame shift mutation. Let's explain these in simple terms. So, a synonymous mutation is when you change a single letter of DNA. But because of the redundancy of the genetic code, and because it's just one change, it doesn't really mess anything up. So, when Evo 2 was shown these mutations, it recognized them and said, "This is fine. It's a high likelihood of survival." But a [11:01] frame shift mutation is different. It's catastrophic. You see, DNA is read in chunks of three. So if you insert or delete just one single letter, every single threelet chunk downstream is shifted by one. So the entire rest of the sequence becomes messed up. Well, when EVO2 was shown these types of mutations, it was also able to correctly say that these would be severely damaging. So as you can see, EVO 2 didn't just memorize letters. It actually reverse engineered how DNA works without ever being taught [11:33] anything. And it gets even crazier. The researchers did an even more challenging test called the ciliate code test. You see, for most life on Earth, the sequence TGA is a stop codon, a universal stop sign that tells the cell to stop building a protein. But biology loves exceptions. And there's a group of organisms called ciliates that evolved a slightly different genetic pattern. In their DNA, TGA doesn't mean stop. It actually means keep going just for them. previous AI models that were trained on [12:04] DNA would completely fail this test. If they were given ciliate DNA and asked about TGA, the AI models would all say that this is a stop sign, which is not correct. But the remarkable thing about EVO 2 is that because it has a million token context window because it can understand DNA across all of life, it's able to inspect this entire ciliate DNA first. And given its context, it's able to determine that TGA inside this was not a stop codon. Note that the researchers never told it that this DNA [12:36] belongs to this organism called a cyot. The AI also never learned that this belongs to a ciliate or what a ciliate even is. Yet, it inferred this from its understanding of the grammar of DNA. Now, so far we've talked about tests on bacteria and other microorganisms. But if this model truly understands the language of life, then in theory, it should also work with human DNA as well, right? And that's exactly what the researchers tested next. I've been covering the best image and video [13:06] generators, and it can get very overwhelming. Fortunately, Higsfield, the sponsor of this video, brings everything together all in one place. From leading image and video generation models to AI cinematic workflows. The idea here is to bring more of a filmmaking workflow to AI video creation. Instead of you just typing a prompt and hoping for the best, you actually follow a creative pipeline. Cast your actor, build the scene, and direct the shot. It starts with sole cast, which is Higsfield's AI actor [13:37] creation system. Instead of prompting random faces every time, you can design fully customizable characters that stay visually consistent across your scenes. You can define their genre, age, identity, physical features, clothing style, and more. Next comes Soul Cinema, which generates cinematic grade images to build your scenes. The model focuses heavily on lighting, composition, and texture, so the outputs look like films instead of AI slop. Once your characters and scenes are ready, you can move [14:07] everything into Sue Cinema Studio 2, which is where the actual video direction happens. This works like a miniature film studio, where you can create multi-shot sequences, control the speed and the camera motion, or you can even define start and end frames, and the system generates the intermediate motion automatically, essentially filling in the animation. All of this makes Higsfield one of the first platforms to truly connect characters, cinematic image generation, and video direction in one seamless workflow. Whether you're making short films, [14:39] cinematic ads, music videos, or any other content, the entire process can happen inside one tool. If you want to try it yourself, check out Higsfield using the link in the description below. This brings us to the third major section of the paper called human variant effect prediction. And this could directly affect real human patients. Here's what they did. The researchers took a huge genetics data set called Clinfair, which is basically a database where doctors record human DNA variations and classify them. For [15:11] example, some DNA variants are labeled pathogenic, meaning they cause disease, and others are labeled benign, meaning they have no measurable impact on health. Now, the researchers specifically targeted the genes BRCA from this data set. And these are very important genes, which you might have heard of before because certain mutations within these genes can drastically increase a person's risk of breast and ovarian cancer. Imagine you're a doctor and you have the DNA report of your patient and you're looking through the BRCA genes to see if [15:43] there are any mutations. A small subtle change could be completely harmless or it could mean cancer which is obviously life-threatening. But it's really hard for a human to actually figure out which mutations are okay, which mutations are not. You see, every human has thousands of genetic variations. And in a sense, we all have different mutations. The overwhelming majority of the time, these mutations don't actually cause anything. But the trick is to identify exactly which mutations actually increase the risk of cancer, which is a lot harder [16:15] said than done. So the researchers gave EVO 2 this data and they asked it to classify which ones are okay, which ones are damaging. Again, note that the model has never been trained on human medical records. It was never given any answers on which genes cause cancer and which don't. It doesn't even know what cancer is. Yet, the results were shocking. The model was able to directly identify the mutations that don't do anything and the mutations that cause cancer. So, it turns out EVO 2 is also a really good [16:45] detector of cancer just from looking at someone's DNA. Now, up to this point, it's all analytical. It's able to analyze and interpret DNA. But can it generate new functional DNA from scratch? Not just a single protein or a tiny gene, but the entire continuous DNA of a living thing? Well, they started with human mitochondria. If you're not familiar, mitochondria is the part of your cell that generates energy. Think of it as like the powerhouse of your cell. Now, it turns out that mitochondria have their own DNA that's [17:15] separate from our cell's DNA. And mitochondria DNA is way smaller. It's only roughly 16,000 letters. So, it's tiny and manageable, but still quite complex. Now, what the researchers did was it just gave EVO 2 just the first few letters of a human mitochondria genome and then they asked it to basically finish this blueprint. It's like giving an AI chatbot the first sentence of an essay and then telling it to finish the essay. And lo and behold, EVO 2 was able to successfully generate [17:45] a completely new mitochondria sequence of roughly 16,000 letters from scratch. Now, generating this DNA is one thing. How do we know if this DNA is actually legit? How do we know if it's biologically viable? So, the researchers used some external validation tools to assess this generation. They started with a tool called Maidoz, which is a wellestablished software for analyzing this mitochondria DNA. And the results were astonishing. Mitoz confirmed that [18:17] EVO 2's madeup generation of mitochondria DNA actually contained the correct number of protein coding genes as well as tRNA genes and rRNA genes as expected in real human mitochondria. Now what exactly do all of these terms mean? Let's break this down. So DNA is, as you know, the master blueprint of life. It contains instructions that tell a cell how to build and operate itself. Now part of these instructions include how to build proteins which are like the [18:47] machines and tools that are used for a cell to function. Now in order to create these proteins DNA also needs to code up stuff called ribosomal RNAs or rRNAs which are like the factory workers. These are in charge of actually assembling the proteins. And then there are transfer RNAs which you can think of as like delivery trucks that bring the right raw materials to the assembly line. Each tRNA carries a specific amino acid and places it precisely where it needs to go according to the [19:18] instructions. So DNA basically needs to code up everything for this factory to work. It needs to code up what to create. It needs to code up the delivery system. Plus, it needs to code up the factory workers. And all three of these things are crucial for an organism to function and stay alive. Now, what's remarkable about EVO 2's generation of this imaginary mitochondria is that it includes all of these features. It includes code on what proteins to build and also the workers in order to build these proteins as well as the delivery [19:50] system. EVO 2 was able to generate a fully working DNA for mitochondria. But the researchers didn't stop there. To make sure these proteins and everything else actually work, they next fed this generation through AlphaFold 3. This is the legendary AI from Google DeepMind that predicts how proteins fold into 3D shapes. By the way, let me know in the comments below if you want me to make an explainer video on Alphafold. I might do one if there is enough interest. Anyways, in short, Alphafold 3 also [20:20] confirmed that Evo 2's generated proteins actually folded correctly. And what's even more important is that these different generated proteins physically interlocked with each other like pieces of a puzzle exactly as they need to do in real human cells in order to produce energy. So this confirms that this DNA which was just generated from EVO 2 actually works just like in real mitochondria. Okay, so we've proven that EVO2 can generate the DNA of this organel. What if we step it up a notch? [20:52] Can we also get it to generate the complete DNA of a living thing? Well, next the researchers got it to generate the full genome of a bacteria called myopplasma genitalia which looks like this for your reference. Now the DNA of this is roughly 580,000 letters. And again the researchers gave it just like the first few letters of this DNA and asked it to generate the rest. And in short it also succeeded. it was able to generate a complete and fully functional genome that resembles [21:23] this bacteria in real life. Now, instead of just bacteria, what if we got it to generate something even more complex? So, next, the scientists asked Evo 2 to generate the entire genome of this other organism called Sakaromyces Sur Visay. This is basically a yeast, which is very different from bacteria. In fact, this organism is more related to humans than it is to bacteria. And again, they just gave it the first few letters of its DNA and then asked it to generate the entire thing. And lo and behold, it was able to [21:55] successfully generate a fully working DNA of this yeast, complete with the protein designs, the factory workers, delivery systems, and everything else it needs to function. So, this proves that EVO 2 can fluently generate DNA for multiple different types of living things, spanning organels to bacteria to even yeast. For now, you might have a burning question in your head. If Evo 2 can generate the DNA of existing stuff, can we get it to generate a completely new species? What if we got it to [22:26] generate a new type of mouse or even generate Pokémon? Or even crazier, what if we got it to generate a new type of human? Well, this raises a ton of different bioeththic and biocurity concerns. So, at least for now, at least the researchers weren't crazy enough to proceed with that. But speaking of security, here's another concern. If this can generate anything, couldn't we prompt it to generate a highly contagious human virus? Well, the good news is the researchers also thought of this beforehand before they trained the model. So, they made a very important [22:59] safety decision right from the start. You see, when they fed it this huge DNA data set of 9 trillion base pairs, they intentionally excluded any sequences of ukarotic viruses, meaning any virus that can affect humans, animals, or plants. So, this includes the most dangerous pathogens listed in the USDA select agents and toxins list. These are like really contagious things that can kill a ton of people. So, all this data was excluded from its data set. In other [23:30] words, the AI was never allowed to learn what these viruses even looked like. It's like if you were being trained on a human language, but all the swear words were omitted. You would have no idea what a swear word even is. Now, excluding this training data is one thing. How do they verify that this safety filter actually works? That it can't generate viruses. So, they used a really standard AI metric called perplexity. No, it's not this perplexity. Perplexity is actually a metric which looks at how confused a [24:01] model is when it encounters something that it hasn't seen before. So after training EVO 2, the researchers also gave it the DNA of some human viruses from this list and basically its perplexity soared. It was really confused. It had no understanding of this, no internal framework to make sense of what these sequences are. Now the researchers went a step further. They also tried to get it to generate the DNA of some dangerous human viruses and see what it would do. Thankfully, [24:32] the model failed completely. Now, it still generated something, but because it didn't have this virus data during training, it just ended up hallucinating and generating gibberish. It was not able to successfully generate the DNA of a legit human virus. Thank goodness for that. Now, the really cool thing is they've actually opensourced everything. So on this GitHub repo, which I'll link to in the description below, you can actually download and use EVO 2. So here it contains all the instructions on how to set this up. Now, in addition to just [25:04] releasing the models, they've also released the data set excluding human viruses, plus also the training and fine-tuning code. So everything is here. If you have enough compute and infrastructure at home, you could potentially download everything and train or fine-tune your own AI model that understands and can generate DNA. Now, if I click on hugging face for the 40 billion parameter model, it is 82 GB in size. You'll need quite a high-end GPU to run this. Anyways, back to EVO 2. This actually opens up a ton of [25:35] possibilities, and this is just the beginning. You see, if we have an AI that can understand DNA and also generate new DNA, we could use it to design better genetically modified crops to create plants that are more nutritious or maybe more resilient or more efficient at producing energy for bofuels. The implications for food security and sustainable energy are huge. But it doesn't just stop at plants. If we were to really stretch this out, maybe we could also get this to analyze the human genome at a level [26:07] much deeper than we currently understand. Maybe it can help us figure out which parts of our DNA we can tweak to make us better. For example, changing a certain gene could make us more productive. Changing another gene could make us stronger or smarter, or changing another gene could make us immune to a certain disease. And this is also great for personalized medicine. If you have a patient with a disease or something that doctors cannot diagnose, you can just extract their DNA and then plug it through this AI to potentially figure out what's going on. Each person's DNA [26:38] is slightly different. So, with this AI, we can use it to help prescribe bespoke personalized healthcare. Or here's another crazy thing. If you're expecting a baby, maybe you can take its DNA and plug it through this AI to detect a ton of things. for example, your baby's hair color, eye color, athleticism, risk of disease, even personality, and a ton of other stuff, even before it's born. Now, of course, all of this comes with enormous ethical and societal questions. Just because we can do this doesn't mean [27:10] we should. In fact, open sourcing this model so that anyone can play around with it is quite risky to be honest. I want to bring up the famous Spider-Man quote, "With great power comes great responsibility." Even though this model wasn't trained on any virus data, because you know this model plus the training code plus the data set was released, maybe someone could potentially take this model and retrain it on a ton of human virus DNA data and create a model that can generate a pandemic beyond our imagination. Anyways, that was quite the deep dive. [27:42] This paper is just jam-packed with stuff and there's a ton of technical things that I didn't even touch on, but that's basically the gist of this paper. Hope I was able to make it easy enough for you to understand and also show you how mind-blowing this AI is. Let me know in the comments what you think of all of this. As always, I will be on the lookout for the top AI news and tools to share with you. So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much [28:12] happening in the world of AI every week. I can't possibly cover everything on my YouTube channel. So, to really stay uptodate with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below. Thanks for watching and I'll see you in the next watching and I'll see you in the next one.