Using Wiki-data to Monitor, Forecast Disease Outbreaks
Researchers at Los Alamos National Laboratory used open source data culled from the online encyclopedia Wikipedia to develop a disease monitoring and forecasting tool they claim is more comprehensive than existing tools.
Using Wikipedia access logs, the government researchers said they applied linear models, language (as a proxy for location) and a Wikipedia article selection procedure to test 14 combinations of location and disease outbreaks. The results suggest that their models could be transferred from one location to another without re-training.
The ability to transfer models across regions opens up the possibility that data analytics could be used to “train” the model using public health data from one location and apply it elsewhere. Such a capability is seen as critical in regions lacking reliable disease databases.
“The goal of this research is to build an operational disease monitoring and forecasting system with open data and open source code,” said Los Alamos scientist Sara Del Valle.
The forecasting tool based on analysis of Wikipedia article views was used to monitor influenza outbreaks in the Japan, Poland, Thailand and the United States along with dengue fever in Brazil and China and tuberculosis in China and Thailand. In all but one instance, the Chinese tuberculosis outbreak, the researchers said they were able to forecasts the spread of the infectious diseases at least 28 days in advance.
Based on their preliminary results, the researchers proposed additional development of their proof-of-concept to create an expanded disease monitoring and forecasting system that covers the entire planet. An obvious region to include is western Africa, where the Ebola virus has been rampaging for months.
Similar disease monitoring tools have already been implemented. HealthMap, a web-based utility and algorithm created by researchers at Boston Children’s Hospital in 2006, was instrumental in spotting the Ebola epidemic at its source in southeastern Guinea last winter.
The HealthMap algorithm developed by researchers, epidemiologists and software developers at the Boston hospital focused on blog postings by medical personnel in the region about treating patients with Ebola-like symptoms. It also flagged local media reports that surfaced in March.
The findings were turned over the World Health Organization, which issued its first public statement about the West Africa outbreak on March 23.
Of the 14 combinations of location and disease outbreaks analyzed by the Los Alamos researchers, the estimate produced by their model closely matched official data. In the remaining cases, the model failed either because the patterns in the official data or in Wikipedia data “were too subtle to capture.”
Hence, the researchers said more “disease-location contexts” must be tested along with improved article selection, geo-location and other improvements in how Wikipedia articles are analyzed. Once these improvements are in place, the performance of models in one context can be tested in others, they said.
Meantime, “We have outlined a plausible path to a reliable, scientifically sound, operational disease surveillance system,” the Los Alamos researchers said.
Their research was published in the journal Computational Biology.