March 18, 2020

AI Called on to Mine Massive Coronavirus Dataset, CORD-19


Working with a coalition of government, academic, and industry leaders, the White House this week released COVID-19 Open Research Data (CORD-19), a massive collection of 29,000 scholarly articles about the ongoing coronavirus outbreak. By releasing the data, leaders hope scientific researchers will be able to accelerate development of treatment for COVID-19 using AI techniques.

CORD-19 represents “the most extensive machine-readable Coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text,” the White House Office of Science and Technology Policy stated in a press release Monday.

“Decisive action from America’s science and technology enterprise is critical to prevent, detect, treat, and develop solutions to COVID-19,” the U.S. CTO, Michael Kratsios, said in the press release. “We thank each institution for voluntarily lending its expertise and innovation to this collaborative effort, and call on the United States research community to put artificial intelligence technologies to work in answering key scientific questions about the novel Coronavirus.”

Multiple groups have volunteered to host the CORD-19 dataset, including Microsoft Research, the Allen Institute for Artificial Intelligence, the National Institutes of Health’s National Library of Medicine, the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Cold Spring Harbor Laboratory, and Kaggle, which is owned by Google.

Kaggle is also hosting a research challenge associated with the CORD-19 dataset to help spur interest in the AI and data science community. There are currently 10 separate tasks that make up challenges. The tasks, each of which carry a $1,000 prize, predominantly ask the community to summarize the data contained in the CORD-19 dataset.

Kaggle stated: “We are issuing a call to action to the world’s artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide.”

According to the Allen Institute for AI, the JSON-formatted data contained in the CORD-19 dataset “contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.)” from sources like PubMed, World Health Organization, bioRxiv, and medRxiv. Metadata is also included. The Seattle, Washington-based institute is also making available a host of handy tools that researchers can use to analyze the CORD-19 data, including SciSpacey, SciERT, and Semantic Scholar API, among others.

Humanity needs AI technology and AI to succeeded in battling the coronavirus pandemic, according to Oren Etzioni, the CEO of the Allen Institute for Artificial Intelligence and a computer science professor at the University of Washington.

“The scientific literature on coronavirus is growing exponentially,” Etzioni said Monday during a conference on the release of CORD-19. “The scientists need AI capability to do their research on COVID-19 quickly and efficiently, with the goal of both doing prevention, detection, treatment, and vaccination.”

In particular, Etzioni highlighted the work the AI community has done on the Sematic Scholar, an academic discovery engine that began five years ago. This work “has prepared us for this moment where humanity needs scientist to succeed and to succeed quickly,” he said.

Microsoft contributed its indexing and mapping technology as part of the CORD-19 initiative, said Eric Horvitz, Microsoft’s chief scientific officer. “Our goal is creating this open data set and Kaggle challenge with coronavirus is to stimulate the AI community to create tools that can help scientists to stay on top of thousands of articles to enable them to develop deeper understanding and approaches to addressing the COVID-19 pandemic,” he said during the White House conference call.

“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings,” Kaggle co-founder and CEO Anthony Goldbloom stated in the press release. “Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19.”

A2I’s Etzioni noted how AI and high tech in general has “gotten something of a bad rap recently,” but is now being called upon to help save humanity. “This [CORD-19 initiative] describes how AI can potentially do a world of good,” Etzioni said. “It’s perhaps ironic that AI, which has caused consternation with facial recognition, deep fakes, etc, is now at the front-lines of helping scientists confront COVID-19.”

But don’t expect AI to deliver a silver bullet solution to coronavirus. “AI won’t solve this problem on its own,” Etzioni said. “AI will enable scientists, doctors, nurses and policymakers to succeed.  In this context, AI means augmented intelligence.”

Related Items:

Tracking the Spread of Coronavirus with Graph Databases

Tech Conferences Are Being Canceled Due to Coronavirus

How the Coronavirus Response Is Aided by Analytics