Microsoft Applies Deep Learning to Vector Search
Since acquiring GitHub last June, Microsoft has sought to make good on its pledge to retain the project collaboration platform’s “developer-first ethos.” This week it turned over to GitHub an AI search tool as an open-source project.
The vector search approach encapsulated in an algorithm called Space Partition Tree and Graph attempts to address the reality that growing data volumes have made keyword search “brittle.” The algorithm takes advantage of deep learning models to search collections of information known as “vectors” in milliseconds.
“As deep learning became more prevalent, we applied it to some of these problems keyword search wasn’t working for,” said Rangan Majumder, Microsoft’s group program manager for Bing search and AI.
The goal is to deliver relevant results faster using vectors, or numerical representations of a data point, word or image pixel. Hence, Majumder said his team built its vector search platform to executive search queries more efficiently.
Deep learning models were applied to those vectors to better understand and represent the intent of a search, addressing ambiguities such as words with different meanings. In the search of searches, researchers also discovered after analyzing logs that searches were getting progressively longer, indicating frustration with conventional keyword searches and what Majumder suspects were users “trying to act like computers.”
That, he added, largely defeated the purpose of a search engine: a relevant answer delivered quickly.
Along with the application of deep learning models, the vector search initiative involved more than 150 billion pieces of data indexed by the search engine to improve the traditional matching of key words. To hardened the otherwise brittle approach, the indexed data included web page content, full queries and other media types along with characters and single words. The search engine then scanned the indexed vectors to come up with a relevant match.
While the “vectorizing” of search data and other media is not new, Microsoft NASDAQ: MSFT argues that it can only be scaled using massive search engines like Bing and Google (NASDAQ: GOOGL). Those search engines process billions of documents daily, and “the idea now is that we can represent these entries as vectors and search through this giant index of 100 billion-plus vectors to find the most related results in 5 milliseconds,” added Jeffrey Zhu, program manager on Microsoft’s Bing team.
Along with faster documents searches, Microsoft said its GitHub code contribution could be used to scan audio snippets to identify and translate spoken language or to provide new apps that could help label images.
While consumer applications based on vector searches are likely to emerge first, the contribution of the Space Partition Tree and Graph algorithm to GitHub also is intended to expand the framework to broader, enterprise applications, the company said this week in a blog post.