Augmented AI for future disease surveillance: how AI can help prevent future pandemics

Fons Klein TuenteLife Sciences & Health Leave a Comment

Emerging pathogens are a major threat to the public health. Over the past decades many scientists have warned us about new conditions that could lead to the introduction of new, potentially devastating pathogens. Despite the warnings, it is clear humans have failed to respond adequately to the pandemic which started in China in December 2019 infecting at the time of writing 4 million people worldwide. A pandemic that could have been curbed earlier on if there would have been more acceptance of AI advising policy makers. On the last day of 2019, the AI-driven Canadian start-up BlueDot picked up upon an unusual phenomenon of a cluster of pneumonia cases in Wuhan and flagged it as potentially dangerous. Nine days before the World Health Organization released its statement of the novel coronavirus. AI processes data faster than any human being and can therefore assist human decision making. Augmented AI focusses on AI’s assistive role, emphasizing the fact that cognitive technology is designed to enhance human intelligence rather than replace it. Augmented AI can assist in many aspects of diseases, like prediction, diagnosis and treatment. But Augmented AI could also, and therefore should, play a crucial role in a surveillance system for diseases and help save thousands of lives. This article extrapolates current technologies and analyzes the potential of augmented AI in the future of disease surveillance.

Basics on viruses

Viruses invade a cell and take over its molecular machinery, causing it to make new viruses. The process is quick and sloppy. As a result, new viruses can gain a new mutation that wasn’t present in their ancestor. If a new virus manages to escape its host and infect other people, its descendants will inherit that mutation. In order to track the mutations of the virus, scientists map the genetic material in a virus, which is its genome. The mapping is called sequencing. Once researchers have gathered the genomes from a number of virus samples, they can compare their mutations. By using sophisticated computer programs, the evolutionary history of the virus can be determined. This information can help us answer questions like, “How fast do mutations occur?” or “Where in the genome do mutations occur?”.

Why does this matter?

This data can be used by bioinformaticians that could help us not only predict a possible emergence, but also predict its progress and analyze possible risks. Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. Bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. By developing machine-learning methods, this data can help us answer important questions at a much faster pace than we did up to know. Up to today, most genomes are still compared manually. In order to understand how a pathogen can infect its host, a lot of work needs to be done. This work involves a combination of lab work and computational work using visualization and human judgement. A process that could take weeks or months. With the development of machine learning algorithms over the past few years, we are able to discover the genetic changes behind emerging pathogens in seconds. This gives us the potential to study outbreaks in real time and thus rapidly inform public health strategies to control or prevent disease.

How much data do we have?

The amount of data on pathogens is growing due to the technical improvements of analysis tools. In contrast to the past when only a limited number of pathogens where manually sequenced, there are now genomic data sets available that were generated from routine pathogen surveillance for epidemiological purposes. But in order to fully utilize this data and the machine learning models available, more data and consistent whole-genome sequencing (WGS) is needed worldwide. Ideally, we would sequence a wide range of pathogens that cause infectious diseases. If the data on full genomes of pathogens would be available, we would be able to answer many important questions that would transform our way of epidemic control. Three highly important questions that could be answered are:

1.    Which hosts can a pathogen infect?

Three coronaviruses have crossed the species barrier to cause deadly pneumonia in humans since the beginning of the 21st century. For a virus to be able to jump from an animal to a human host, it needs a certain trait encoded in their DNA. This trait does not consist of a fixed set of genes, but entails multiple combinations and varies among different types of pathogens and strains. There is much unknown about the interaction between genes and hosts and therefore more data on pathogens and their hosts should be mapped. Then AI models can be used to help us understand the mechanism of certain genes sets to the ability to bind to certain hosts. Information that would be helpful when a new pathogen or strain emerges and decisions to curb infections have to be made right away. Even though the data is not yet sufficient, the machine learning algorithms needed to do this are already in use today.

Pathogens cause infectious diseases, of which foodborne illnesses are a common example. Food poisoning claims about 420,000 lives worldwide every year. When multiple people report food poisoning in a specific area at a specific time, an outbreak is suspected. Salmonella is a pathogen of which a lot of data exists due to extensive research over the last years. Salmonella has several hosts as it can be derived from different sources of food, ranging from some animal sources to different kinds of greens. Machine learning models are used to define the different strains of the bacteria and their hosts. Understanding the relationship between the genes of the bacteria and the hosts it can affect, helps us assign a source to possible future mutated strains. By sequencing the bacteria of one of the patients of the outbreak, an algorithm can predict which source the human food poisoning could have come from, despite the fact this new mutated form of the bacteria has never been documented before.

These models can be applied to viruses as well. A virus enters its host by binding to proteins on the surface of the host cells. The ability of the virus to bind and the type of protein it binds to, are the most important determinants of the range of hosts it can infect, and how infectious the virus is. Animals and humans do not have all proteins in common. That is why a certain pathogen might bind to pigs, but not to humans. To identify mutations that transform a pig virus into a human one, scientists traditionally compared the protein sequences from virus strains before and after they developed the ability to infect people. Investing in complete and clean data sets and algorithms can speed up the process from analyzing only tens of viruses at a time to tens of thousands of viruses. When the process can be sped up, the work can facilitate surveillance of new sources of epidemics, predict species ranges for these viruses and identify potential animal reservoirs for infection.

The new coronavirus

There is not much known about the new CoV-viruses that are discovered every year and we therefore do not know whether theses viruses have the potential to spread to humans. However, we do know how SARS infects humans, as much research has been done since the outbreak in 2002. Researchers discovered that the virus binds to human cell proteins called ACE2. ACE2 is an enzyme attached to the outer surface tissues of our lungs, arteries, heart, kidney, and intestines and is important for controlling our blood pressure. Once the virus binds to ACE2, the enzyme is moved inside our cells and broken down, which is thought to contribute to the lung damage we see with coronavirus infections. This cell entry allows the virus to hijack our cells’ machinery to make more of itself. The viruses that cause the common cold don’t bind to this enzyme, which is why they don’t cause lower respiratory tract infections or the pneumonia as they appear with SARS.

As soon as the first genome sequence of COVID-19 became public, researchers compared it with the one of SARS to analyze the differences and assess how this would change the binding. Through this work, they were able to show that ACE2 is very likely the route of entry of this new virus into our cells and they were able to comment on how strongly the virus binds to our receptors, as well as those of other animals.

Knowing the existence of this “ACE2-route” for viruses, it allows us to predict which other viruses could infect humans via the same route, which viruses are a few mutations away from being able to do so, and which animal hosts may be carrying human epidemic threats.

2.   How dangerous is this pathogen?

Let’s return to our Salmonella example; salmonella includes many different types that vary in the severity of the disease they cause. Some types cause food poisoning, whereas others spread beyond the gut and cause severe typhoid fever or can evolve to cause bloodstream infections. To understand the genetic changes that determine whether an emerging strain of salmonella will cause food poisoning versus a more severe infection, researchers built a machine learning model that analyses which mutations play an important role. The model identified almost 200 genes that are involved in determining the type of infection. Because sufficient data is available on this bacterium, models can identify which emerging strains of bacteria could become a public health concern. This is a great step forward in the surveillance of dangerous bacteria at a global scale.

Even though we don’t have an equal amount of data on coronaviruses as we have on salmonella, the same models could be applied to them for future purposes. As the models can predict which mutations of salmonella make it better at spreading beyond the gut, the models could predict which mutation of corona could make it bind to ACE2 even stronger, therefore making it more contagious and dangerous. By understanding these properties, researches could use algorithms that can detect these high-risk strains, allowing for faster detection and containment of the high-risk diseases. It also offers great help in developing more effective vaccines.

3.   Which treatments will be effective in curing an infection?

In order to effectively design drugs or vaccines, we have to know which genes are mutating, the frequency of the mutations and the binding mechanism of the pathogen to the human cells. Knowing the point of entry of a virus makes designing drugs to prevent infection easier. By understanding which genes might be changing, several scenarios can be constructed for possible future mutations and the medication can be developed accordingly. The speed of mutation is important, to understand if the development of a vaccine makes sense.

Spike protein (S)

Previous research revealed that coronaviruses invade cells through so-called “spike” proteins. This protein mediates the virus’ entry to the human body, as it binds to the before mentioned ACE2 enzyme. Spike proteins take on different shapes in different coronaviruses and finding it is the first step in vaccine development. Researchers derived its structure at the beginning of February from the virus’ sequence. The team responsible for this critical breakthrough had spent years working on other coronaviruses, including SARS-CoV and MERS-CoV. Because of their previous extensive research, they knew which genes are responsible for constructing the spike protein. Solely knowing its “recipe”, which consists of a mix of amino acids, is not enough. Proteins are large and complex molecules and one should understand how every amino acid binds to the host’s cell. Finding its unique 3D structure includes a lot of puzzling, which is called ‘folding’. Because there are so many degrees of freedom, the number of ways a protein can fold is astronomical. Identifying the best structure requires significant time and computing power. 

Even citizen scientists are pitching in to help design potential treatments. By using a crowdsourcing computer game which is called Foldit, people from all over the world can fold proteins and submit their optimal folding structure. The platform releases challenges asking users to design proteins that could block the binding of the virus to human cells. Promising solutions are then being grown in the lab and tested to see if they work.

No alt text provided for this image

Citizens helping to combat is a welcome gesture, but if these analyses are executed by humans, it would take longer than the age of the known universe to examine all possible shapes of a protein before finding its unique 3D structure. It is therefore essential to automate the process of modeling protein structures. Especially when you take into account that, like any other cell, the spike protein can also mutate over time.

Since the beginning of this year, many initiatives to automate this process have been launched. In January, Google DeepMind introduced AlphaFold, a cutting-edge system that predicts the 3D structure of a protein based on its genetic sequence. But more labs are making progression as well and keeping up with the results, and perhaps consolidating technologies in the future, can save months or even years of work.

By understanding the spike protein AI can help by suggesting components of a vaccine. Preventing that initial binding process of the spike protein can deny the virus the opportunity to enter, and infect, a healthy cell.

Researchers have found some good news too, which is that SARS-CoV-2 virus appears to be mutating more slowly than the seasonal flu which may allow scientists to develop a vaccine. Specifically, SARS-CoV-2 seems to have a mutation rate of less than 25 mutations per year, whereas the seasonal flu has a mutation rate of almost 50 mutations per year. The significantly slower mutation rate of SARS-CoV-2 gives us hope for the potential development of effective long-lasting vaccines against the virus.

In the future, researchers could use AI to run unsupervised learning algorithms that could simulate all possible evolution paths of the virus. By virtually adding potential vaccines, it can be analyzed if the viruses could mutate to develop resistance. This will allow virologists to be a few steps ahead of the virus and create vaccines in case certain mutation-scenarios arise.

Practical limitations

Data and algorithmic models aren’t the only bottlenecks. In order to analyze pathogens and develop effective medication and vaccines, tremendous amounts of processing power are needed. The US government started the COVID-19 High Performance Computing Consortium and it brought together the U.S. government, industry, and academia to provide access to the powerful high-performance computing resources. In China, Huawei cloud provides an automated sequencing analysis engine and the much-needed computing power to support multi-sample cloud analysis. Alibaba is also offering free AI computing power to scientific research organizations to support the search for a vaccine or treatments. Researchers at Stanford University made an appeal to civilians to donate some of their unused CPU and GPU computing power to help search for a vaccine for the coronavirus. The distributed computing project is called Folding@home and allows researchers to have access to a virtual supercomputer.


Emerging pathogens are a major threat to public health and understanding how pathogens adapt remains a challenge. Data on pathogens is insufficient to conduct thorough analysis and, despite great developments in the last two years, models to provide insights are also underdeveloped. But there is hope. The genomic data sets are growing and research in machine learning is accelerating. Machine learning algorithms have proven to be able to detect significant mutations in genes, making something able to infect humans or making it more dangerous. Scientists around the world are using genetics to better understand the bacteria causing infections, how diseases spread, how pathogens gain resistance to drugs, and which strains may cause outbreaks.

The future of disease surveillance, however, does not merely rely on data and AI algorithms. Strong leaders are needed, to get governments, businesses and health care providers to trust these tools and to create support to implement them during disease outbreaks. Technology already offers incredible possibilities, but if we do not lower the acceptance barrier of AI in society, we will miss out on many of its advantages, just like what happened with the warnings of BlueDot. Making the most of AI will take a lot of data, time, and smart coordination between many different stakeholders globally.

Leave a Reply

Your email address will not be published. Required fields are marked *