MIT Spinout Exploits GPU Memory for Vast Visualization
An MIT research project turned open source project dubbed the Massively Parallel Database (Map-D) is turning heads for its capability to generate visualizations on the fly from billions of data points. The software—an SQL-based, column-oriented database that runs in the memory of GPUs—can deliver interactive analysis of 10TB datasets with millisecond latencies. For this reason, its creator feels comfortable is calling it “the fastest database in the world.”
Map-D is the brainchild of Todd Mostak, who created the software while taking a class in database development at MIT. By optimizing the database to run in the memory of off-the-shelf graphics processing units (GPUs), Mostak found that he could create a mini supercomputer cluster that offered an order of magnitude better performance than a database running on regular CPUs.
“Map-D is an in-memory column store coded into the onboard memory of GPUs and CPUs,” Mostak said today during Webinar on Map-D. “It’s really designed from the ground up to maximize whatever hardware it’s using, whether it’s running on Intel CPU or Nvidia GPU. It’s optimized to maximize the throughput, meaning if a GPU has this much memory bandwidth, what we really try to do is make sure we’re hitting that memory bandwidth.”
During the webinar, Mostak and Tom Graham, his fellow co-founder of the startup Map-D, demonstrated the technology’s capability to interactively analyze datasets composed of a billion individual records, constituting more than 1TB of data. The demo included a heat map of Twitter posts made from 2010 to the present. Map-D’s “TweetMap” (which the company also demonstrated at the recent SC 2013 conference) runs on eight K40 Tesla GPUs, each with 12 GB of memory, in a single node configuration.
Mostak searched the database of tweets for terms such as “flu.” The results were overlaid on a map of the United States, and then played out over time. Flu-related search hits starting in the south (where the flu made its entry during the 2012 flu season) and progressed into the Northeast. He did the same for tweets related to “snow,” and the hits matched the march of storms across the U.S. “It’s not terribly useful,” he admits,” but it demonstrates the power of the system.”
Graham and Mostak–who previously worked MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)–are in the process of developing Map-D into a full-fledged business, based on the commercial open source model.
|A Map-D heatmap of political campaign donations|
The company, which was founded in 2013, is currently working with several clients, including NASA, which is seeking a better visualization engine for analyzing historical ice flows; PayPal, which need a real-time visualization platform to monitor the 3 million-plus data points it generates per second; pharmaceutical giant Novartis, which is looking to speed up interactive pattern matching for drug R&D; Major League Baseball, which is looking for an interactive platform to analyze every pitch made since 1980 for in-game broadcasts and its website; and the U.S. government, which is exploring Map-D for various military applications of GIS and Web mapping platforms.
“We like to say it allows the science of the process to occur at the speed of thought,” Mostak says during the Webinar. “If you have a hypothesis, with a normal system you’d have to test the hypothesis by making a query, wait two hours, make coffee, take a nap, come back. But what we’re doing is you can immediately test your hypothesis, iterate, and basically refine that hypothesis, and test again. So you’re doing scientific process at the speed of thought.”
The product has been tested against Nvidia’s GPUs, and the company is currently working with Intel to get it running reliably against the Intel Phi platform. Map-D is also working on getting the database to run on mobile chips, such as Nvidia Tegra, and on ramping up the scalability. It currently is working on a four-node cluster with 32 GPUs.
Map-D is hoping to capitalize on the need for real-time, interactive analytics platforms that deliver low latency and allow users to act upon the data as it arrives. This pits the software against the analytics elephant in the room–Hadoop. But Mostak has his own thoughts on how analytics can best be delivered.
|Map-D creator Todd Mostak|
“One thing Hadoop won’t allow you to do is interactive analysis–scanning a billion tweets or scanning political donation records or cell phone records in milliseconds, and being able to visualize it and see patterns and changes,” he says. “A lot of times, when people talk about big data, they think, ‘Oh it has to be petabytes and it has to be running on Hadoop.’ But really what oftentimes big data is, is pushing the limits of what you can do given the size of the data set.”
Also, no indexing. “Basically it’s relying on the raw power of graphics processors to do everything in real time, so you’re not limited to what the person who made the database schema decided to pre-compute or indexed,” Mostak continues. “Map-D doesn’t require indexing. Since it’s doing raw scans, you’re going to get great performance out of the box.”
Regarding the performance claims, Mostak and Graham defend calling Map-D the “fastest database in the world.” “While we think that’s a bit of a big claim, we believe we can back that up by showing you that we’ve been working with the world’s fastest technology, namely Nvidia GPUs,” Graham says at the beginning of the webinar.
Says Mostak: “We can easily claim to be the fastest database in the world, because we’re running the most optimized system on the fastest hardware out there, which currently graphics processor units.”
And that performance will only increase with the coming advances in GPU architectures. “We’re working with scientists at MIT and Nvidia to optimize the database, optimize the GPU kernels,” Mostak says. “We have time on our side. The power of GPUs, the memory bandwidth, the parallelism–it’s all getting much, much faster. In fact GPUs are getting more powerful relative to CPUs. In two to three years, I think Map-D will be even better positioned.”
The roadmap calls for further tweaking the Map-D to support enterprise SQL functions and to support the database running against datasets in the 100 TB range. The product already sports a JSON API, making it useful for sharing information over the Web. The company is also working on machine learning, neural nets, and SVMs (support vector machines), “which all run really well on GPUs,” Mostak says.