About 10 years ago, Žiga Avsec was a PhD physics student who found himself taking a crash course in genomics via a university module on machine learning. He was soon working in a lab that studied rare diseases, on a project aiming to pin down the exact genetic mutation that caused an unusual mitochondrial disease.
This was, Avsec says, a “needle in a haystack” problem. There were millions of potential culprits lurking in the genetic code—DNA mutations that could wreak havoc on a person’s biology. Of particular interest were so-called missense variants: single-letter changes to genetic code that result in a different amino acid being made within a protein. Amino acids are the building blocks of proteins, and proteins are the building blocks of everything else in the body, so even small changes can have large and far-reaching effects.
There are 71 million possible missense variants in the human genome, and the average person carries more than 9,000 of them. Most are harmless, but some have been implicated in genetic diseases such as sickle cell anemia and cystic fibrosis, as well as more complex conditions like type 2 diabetes, which may be caused by a combination of small genetic changes. Avsec started asking his colleagues: “How do we know which ones are actually dangerous?” The answer: “Well largely, we don’t.”
Of the 4 million missense variants that have been spotted in humans, only 2 percent have been categorized as either pathogenic or benign, through years of painstaking and expensive research. It can take months to study the effect of a single missense variant.
Today, Google DeepMind, where Avsec is now a staff research scientist, has released a tool that can rapidly accelerate that process. AlphaMissense is a machine learning model that can analyze missense variants and predict the likelihood of them causing a disease with 90 percent accuracy—better than existing tools.
It’s built on AlphaFold, DeepMind’s groundbreaking model that predicted the structures of hundreds of millions proteins from their amino acid composition, but it doesn’t work in the same way. Instead of making predictions about the structure of a protein, AlphaMissense operates more like a large language model such as OpenAI’s ChatGPT.
It has been trained on the language of human (and primate) biology, so it knows what normal sequences of amino acids in proteins should look like. When it’s presented with a sequence gone awry, it can take note, as with an incongruous word in a sentence. “It’s a language model but trained on protein sequences,” says Jun Cheng, who, with Avsec, is co-lead author of a paper published today in Science that announces AlphaMissense to the world. “If we substitute a word from an English sentence, a person who is familiar with English can immediately see whether these substitutions will change the meaning of the sentence or not.”
Pushmeet Kohli, DeepMind’s vice president of research, uses the analogy of a recipe book. If AlphaFold was concerned with exactly how ingredients might bind together, AlphaMissense predicts what might happen if you use the wrong ingredient entirely.
The model has assigned a “pathogenicity score” of between 0 and 1 for each of the 71 million possible missense variants, based on what it knows about the effects of other closely related mutations—the higher the score, the more likely a particular mutation is to cause or be associated with disease. DeepMind researchers worked with Genomics England, a government body that studies the growing pool of genetic data collected by the UK’s National Health Service, to verify the model’s predictions against real-world studies on already-known missense variants. The paper claims 90 percent accuracy for AlphaMissense, with 89 percent of variants classified.
Researchers who are trying to find out whether a particular missense variant may be behind a disease can now look it up in the table and find its predicted pathogenicity score. The hope is that, just as AlphaFold is boosting everything from drug discovery to cancer treatment, AlphaMissense will help researchers in multiple fields accelerate research into genetic variants—allowing them to diagnose diseases and find new treatments faster. “I hope that these predictions will give us an extra insight into which variants cause disease and have other applications in genomics,” says Avsec.
The researchers stress that the predictions should not be used on their own, but only to guide real-world research: AlphaMissense could help researchers prioritize the slow process of matching genetic mutations to diseases by quickly ruling out unlikely culprits. It could also help improve our understanding of overlooked areas of our genetic code: The model includes an “essentiality” metric for each gene—a measure of how vital it is to human survival. (The function of roughly a fifth of human genes isn’t clear, despite many appearing to be essential.)