Follow Datanami:
January 25, 2021

Using Genetic Grammar, MIT NLP Model Examines How Viruses Escape


As the COVID-19 pandemic crosses the one-year mark, its various mutations are increasingly dominating headlines. Part of this is due to some variants’ increased infectiousness, but much of it also stems from an underlying worry: what if this mutation resists the vaccine? This phenomenon, called viral escape, is a serious problem, particularly for viruses that are more prone to mutating exactly where vaccines and antibodies target them. Now, MIT researchers are using natural language processing (NLP) models to understand how viral escape occurs.

“Viral escape is a big problem,” said Bonnie Berger, a professor of mathematics, the head of the computation and biology group in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and one of the senior authors of the paper, in an interview with MIT’s Anne Trafton. “Viral escape of the surface protein of influenza and the envelope surface protein of HIV are both highly responsible for the fact that we don’t have a universal flu vaccine, nor do we have a vaccine for HIV, both of which cause hundreds of thousands of deaths a year.”

To understand the rates at which these different viruses mutate while remaining functional, Berger and her colleagues applied NLP modeling, which – as the name suggests – were designed to analyze and predict linguistic patterns. In essence, the researchers treated the genetic sequences as sentences and the constraints of viral functioning as a form of genetic grammar that must be obeyed. 

“If a virus wants to escape the human immune system, it doesn’t want to mutate itself so that it dies or can’t replicate,” explained lead author and MIT graduate student Brian Hie. “It wants to preserve fitness but disguise itself enough so that it’s undetectable by the human immune system.”

The researchers trained the model on 60,000 HIV sequences, 45,000 flu sequences and 4,000 coronavirus sequences. Then, they used the model to predict which parts of key viral proteins were more or less likely to “escape.” This information is critically useful, as it suggests which parts of those proteins – for instance, the S2 subunit of SARS-CoV-2’s spike protein – might be among the most future-proof as drug targets. Moving ahead, the researchers are looking at whether they could identify targets for cancer vaccines.

“There are so many opportunities, and the beautiful thing is all we need is sequence data, which is easy to produce,” said Bryan Bryson, an assistant professor of biological engineering at MIT and another of the paper’s senior authors.