Unstructured Data Search Engine Has Roots in HPC
A search engine first developed in the HPC world to identify anomalies in biomedical images and then used by the military to track terrorists from UAV imagery is now being applied to the world’s vast supply of unstructured data, garnering the nickname “Facebook for Everything.”
Search was the original big data challenge (see: origins of Hadoop) and remains an instrumental part of the data analytics stack, with products like Solr and ElasticSearch continuing to attract a lot of attention and usage. And yet most search engines, like the ones powering Google and Microsoft‘s Web services, are primarily focused on enabling search and retrieval from structured content, i.e. indexing documents and Web pages.
But the volume of unstructured content, such as images, video, and audio, far exceeds the volume of structured data in the world. IDC estimates that unstructured content accounts for a 90% of all data, leaving organized words and numbers trailing far behind cat videos. With the amount of data in the world projected to grow to 40 zettabytes by 2020, that leaves a lot of data basically invisible to the traditional forms of search.
One company hoping to tap into the morass of unstructured data is DataFission. The San Jose, California firm was founded in 2013 with the goal of productizing a scale-out search engine , called the Digital Universe Search Engine, or DUSE, that it claims can index just about any piece of data, and make it searchable from any Web-enabled device.
“It’s like image search, but it’s a much more general pattern-search than that, because we can search any sort of digital data, whether it’s audio or video or image,” says DataFission’s Chief Scientist Dr. Harold Trease. “The code doesn’t care. It’s just searching for a pattern.”
High-Dimensional Feature Vectors
The DataFission product works, at a high level, just like other search engines.
First, it indexes the raw data that the client wants to search, and then places the index tables, which are 100x to 1000x smaller than the original source data, onto the DataFission server. Users can then execute searches against the tables , which is stored in NoSQL stores or HDFS, in the form of text strings or even by dragging and source images, videos, or audio files into DUSE’s search bar (or programmatically via REST APIs).
The secret sauce lies in how the company indexes the data. A combination of machine learning techniques, such as principal component analysis (PCA), clustering, and classification algorithms, as well as graph link analysis and “nearest neighbord” approach help to find associations in the data.
While the company isn’t sharing all the details about how its algorithm was built, Dr. Trease provided some general info on how it works.
“We generate a high-dimensional signature, a high-dimensional feature vector, that quantifies the information content of the data that we read through,” he says. “We’re not looking for features like dogs or cats or buildings or cars. We’re quantifying the information content related to the data that we read. That’s what we index and put in a database. Then if you pull out a cell phone and take a picture of the dog, we convert that to one of these high-dimensional signatures, and then we compare that to what’s in the database and we find the best matches.”
The software doesn’t have a notion of what is a “dog,” Dr. Trease continues, nor does it look to extract features that are associated with a dog. It’s only through the feature vector that that essential “dogness” is expressed and made available for searching and retrieving. This process works the same whether or not the source data is a cat video on YouTube, a picture of Timmy hugging Lassie, underwater audio of supertankers arriving at Long Beach Harbor, or C++ source code from a piece of malware.
DUSE enables any and all pieces of unstructured data to be stored in the same repository, enabling patterns to be detected across different data types. “It’s really good at connecting the dots,” Dr. Trease tells Datanami. “You can discover a certain number of relationships with textual data. But if you combine that with text, audio, and video sensor data into one big search space, then you can cross-connect across all these and it will tell you an awful lot more about connections.”
DataFission calls the information generated by DUSE’s indexing algorithm “Galaxy Plots” because, when visualized, they appear like the nighttime sky (see the image at the top of this story for an example).
“If we index a billion images, we’d end up with a billion points in this search space, and we can look at that search space it has structure to it, and the structure is fantastic,” says Dr. Trease, who’s wife, Lyn Trease, owns DataFission. “There’s all kinds these points and clusters and strands that connect things. It makes little less sense to humans, because we don’t see things like that. But to the code, it makes perfect sense.”
HPC Gov’t Roots
The patented technological underpinnings in the DataFission product trace their roots back to the 1990s, when the company’s founders focused on developing a system to automatically identify tumors and other anomalies hidden across large numbers of MRIs and other biomedical imagery.
“Then we ran into intel analysts who were watching walls of video screens,” Dr. Trease says. “They’d watch 16 screens at once for hours and we said ‘Why are you doing this? Let’s just build a tool to help you.'”
After getting some funding from a high performance computing (HPC) group at the Department of Energy, the technology was adopted by the intelligence community to chew through hours of video footage collected by Predator drones and other UAVs (unmanned aerial vehicles) tracking the movement of suspected terrorists in the Middle East.
The technology helps America’s military and spy agencies solve the “needle in a haystack” types of problems. “They needed ways to be able to say that they had seen a car meet another car at a building, and they needed a way to know how many times that has happened over the last two years,” he says. “To a certain extent, people can remember a little bit. But to get it to that fidelity they needed a tool.”
The Wider Web of Things
DataFission is now seeking to broaden its reach outside of the military and intelligence communities. The timing is good, considering the rising need to analyze real-time flows of unstructured data emanating from the Internet of Things (IoT).
Indexing video is one use case. A potential client is analyzing the viability of using DataFission to index decade’s worth of film to identify product placement opportunities. Another wants to use the technology to count the number of solar panel installations in the United States.
There are many potential uses for the technology. While the military and intelligence community is able to get eyes on 5% of the data, only 1% of publicly available data is ever put in front of human, Dr. Trease says. DataFission says 88% of this so-called “Digital Universe” is video and still photos, while 10% is audio. The remaining 2% is text, but only a small fraction (about 5%) of that is indexed by Google.
“I don’t want to ding Google. I use it every day. I’m happy they’re out there,” he says. “But they are making a very small part of the Digital Universe searchable.”
The product could also be used to help cull the large amount of unstructured data flowing into an organization. DUSE could be useful for indexing large flows of unstructured data arriving via Apache Kafka queues. Because it only needs to read the source data once, the source data can be safely discarded, while its core features are safely indexed in an archive, and made available for analysis at a later time.
Cloud and On-Prem
Indexing unstructured data is a big job, to be sure, and it takes a lot of horsepower for DataFission to do run. It would take two full racks of servers a weekend to index 1PB worth of video, Dr. Trease says. And once the data is indexed, you’d want to hold onto that computing horsepower to ensure searches come back quickly. “It’s pointless to collect this data and not make it searchable,” he says.
DataFission used to rely on exotic HPC setups to provide that computing power, but today it’s relying on modern scale-out architectures consisting of 64-bit Linux clusters. The software itself, which today exists as a Python-based Apache Spark application, can be obtained as software product or fully configured on a hardware appliance called DataHunter.
The company makes use of GPU and FPGA accelerators whenever it can. “The thing can never be fast enough,” Dr. Trease says. “NVidia keeps producing faster GPUs and Intel is adding Phi accelerators to their X86. They look pretty good. Intel just announced a hybrid X86 FPGA server. That looks like a good way to go.” It’s also investigating DUSE running on a adiabatic quantum computer from D-Wave Systems.
These days the company is working on getting its cloud offering off the ground. There’s still some work to do, particularly around ensuring high levels of fault tolerance and resilience, Dr. Trease says. It could become available as soon as next spring.