July 18, 2022

IBM Research Open-Sources Deep Search Tools


IBM Research’s Deep Search product uses natural language processing (NLP) to “ingest and analyze massive amounts of data—structured and unstructured.” Over the years, Deep Search has seen a wide range of scientific uses, from Covid-19 research to molecular synthesis. Now, IBM Research is streamlining the scientific applications of Deep Search by open-sourcing part of the product through the release of Deep Search for Scientific Discovery (DS4SD).

DS4SD includes specific segments of Deep Search aimed at document conversion and processing. First is the Deep Search Experience, a document conversion service that includes a drag-and-drop interface and interactive conversion to allow for quality checks. The second element of DS4SD is the Deep Search Toolkit, a Python package that allows users to “programmatically upload and convert documents in bulk” by pointing the toolkit to a folder whose contents will then be uploaded and converted from PDFs into “easily decipherable” JSON files. The toolkit integrates with existing services, and IBM Research is welcoming contributions to the open-source toolkit from the developer community.

IBM Research paints DS4SD as a boon for handling unstructured data (data not contained in a structured database). This data, IBM Research said, holds a “lot of value” for scientific research; by way of example, they cited IBM’s own Project Photoresist, which in 2020 used Deep Search to comb through more than 6,000 patents, documents, and material data sheets in the hunt for a new molecule. IBM Research says that Deep Search offers up to a 1,000× data ingestion speedup and up to a 100× data screening speedup compared to manual alternatives.

The launch of DS4SD follows the launch of GT4SD—IBM Research’s Generative Toolkit for Scientific Discovery—in March of this year. GT4SD is an open-source library to accelerate hypothesis generation for scientific discovery. Together, DS4SD and GT4SD constitute the first steps in what IBM Research is calling its Open Science Hub for Accelerated Discovery. IBM Research says more is yet to come, with “new capabilities, such as AI models and high quality data sources” to be made available through DS4SD in the future. Deep Search has also added “over 364 million” public documents (like patents and research papers) for users to leverage in their research—a big change from the previous “bring your own data” nature of the tool.

The Deep Search Toolkit is accessible here.

