Yahoo’s Vespa Takes a Whack at CORD-19 Data
Verizon Media (formerly Yahoo) is giving its new Vespa search engine a chance to show what it can do against CORD-19, the collection of scholarly articles about COVID-19. The company is inviting the public to try using Vespa against the data set.
The White House unveiled its COVID-19 Open Research Data (CORD-19) initiative three weeks ago to help spur research into possible treatments of COVID-19 by AI, text mining, and natural language processing (NLP) experts. The dataset initially contained 29,000 entries, but it has since increased substantially since then.
Verizon Media decided to get involve and do what it could to assist with the effort, says Kristian Aune, a tech product manager with Verizon Media. “Given our experience with big data at Yahoo (now Verizon Media) and creating Vespa (open source big data serving engine), we thought the best way to help was to index the dataset, which includes over 44,000 scholarly articles, and to make it available for searching via Vespa Cloud,” Aune writes in a blog post on the Yahoo Developers site.
The team’s website (https://cord19.vespa.ai/) allows users to search the entire CORD-19 repository using keywords and phrases. A search for “vaccine” yielded more than 4,200 matches, while a search for “hydroxychloroquine” yielded 30. Clicking through on the search results page leads the user to full abstracts of the selected paper, along with a link to read the full report on the site it’s located.
Vespa is a distributed search and recommendation engine that Yahoo started developing in 2003 to serve personalized results from large data sets at massive scale in real time. While Yahoo was the main backer of Hadoop and ran most of its company on that open source platform in 2014, Vespa’s combination of speed and scale gave Yahoo capabilities it could not get from other technologies, including Hadoop and Storm.
Yahoo eventually based many of its internal systems on Vespa, including its core Web search engine, advertising, and image search, among others. When Oath (as Yahoo was then called) released Vespa as an open source project in 2017, the company said Vespa was serving 90,000 pieces of content or advertisements per second. On Flickr, Vespa delivered a few hundred queries per second across tens of billions of images.
Yahoo put a lot of time into architecting the Java-based search engine to be able to serve large amounts of data at very low latencies. Vespa distributes data and computation across many machines without the bottleneck of having a single master, the company explained. Data is constantly re-distributed in the background to maintain consistency and prevent downtime due to single machine failures.
Vespa consists of a combination of stateless Java container clusters and associated content clusters that store data, Verizon Media says on the Vespa project website. This stateless approach simplifies scaling for administrators, while the incorporation of middleware logic for reading and writing data simplifies life for developers. For more info on Vespa, see vespa.ai.