Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
The GPU “Sweet Spot” for Big Data
Although GPU acceleration is a relatively little-discussed topic in comparison to the hype around in-memory databases and emerging software frameworks, it is not because of a lack of research or set of use cases.
In fact, according to Sumit Gupta, senior manager in NVIDIA’s Tesla GPU High Performance Computing group, GPUs have definitely extended beyond their traditional base in scientific HPC centers. The Tesla manager shared news about some major enterprise customers who are using GPUs for their large-scale data mining initiatives, and talked briefly about some interesting GPU-boosted social media projects the federal government has been exploring.
New frameworks for tackling the “Three V” elements of big data (volume, variety and velocity), including MapReduce—not to mention more mature approaches to data mining (R, for example) are not pushing GPU bakers into new territory, Gupta suggests.
At the heart of this “nothing new” approach to the din around big data, Gupta says that what has made GPUs ideal for large-scale analytics in the past hasn’t shifted much. He notes that most data mining applications that leverage classic algorithms for classifying and analyzing data, including SVM (not to mention newer open source projects like Apache Mahout) all boast a C kernel, which makes them prime candidates for GPU computing approaches. Putting a real-time spin on these applications requires complex systems and applications, but the base classification methods supporting these (again, like SVM) haven’t changed.
Gupta stresses that GPUs get into the underling server and whatever is running on a single node runs faster. This means that it doesn’t change the fundamental approach to solving problem, it means that the algorithm behind the problem can be more complex with the added performance, which he says leads to better accuracy for everything from pattern matching to fine-tuned classification.
While this is a benefit for the end user, there are some kinks that need to be worked out, which Gupta says an internal team dedicated to analytics is solving. These specialists are not only plugging away end-to-end solutions that create a complete workflow in a large-scale data analytics environment, there’s trying to address a critical I/O issue. In some cases, Gupta says,“the GPU accelerates a single server so much that better, faster disks and networks are needed, so in essence, it moves the problem.”
Outside of the “over-performance” of the GPU nodes, NVIDIA is watching how the recent popularization of MapReduce is marrying GPU approaches. One can argue that MapReduce has been revolutionary in its ability to pull some complexity out of the developer’s queue by removing the need to try to compute complex problems on a single machine. Instead, MapReduce maps a problem into pieces before bringing it back via a reduction to aggregate the results. It’s not just about the simplification, says Gupta. When coupled with the speedup on GPU nodes, the performance permits far better accuracy with the added algorithmic complexity the system can handle.
The GPU and MapReduce combination has been the subject of some notable research and is finding its way into more popular use in GPU environments, says Gupta. Azinta Systems founder and advocate for using GPUs to boost large-scale data mining, Suleiman Shehu agrees that GPUs can revolutionize large-scale data mining by significantly speeding up the processing of data mining algorithms. He points to the K-Means clustering algorithm as a prime example, stating, “the GPU-accelerated version was found to be 200x-400x faster than the popular benchmark program MimeBench running on a single core CPU, and 6x-12x faster than a highly optimised CPU-only version running on an 8 core CPU workstation.”
Shehu goes on to note that the volume factor of big data isn’t a hindrance for the GPU, noting that in one example, a “data set with 1 billion 2-dimensional data points and 1,000 clusters, the GPU-accelerated K-Means algorithm took 26 minutes (using a GTX 280 GPU with 240 cores) whilst the CPU-only version running on a single-core CPU workstation, using MimeBench, took close to 6 days (see original paper here)…Substantial additional speed-ups are expected were the tests conducted today on the latest Fermi GPUs with 480 cores and 1 TFLOPS performance.”
High performance algorithms and approaches go beyond MapReduce, however, says Gupta, pointing to the overall popularity of the open source statistical software environment around R. He pointed to the large number of GPU plugins for R, noting that R represents a “sweet spot” for the GPU—not to mention a very sweet spot for enterprise analytics needs.
Next — On R and “Mainstream HPC” >
The big problem with R, as many in the large community around it have noted, is that it comes with no parallelism. As stated in piece discussing the challenges of R in large-scale environments, the other problem is that most of the R function implementations can only be executed as separate instances on multicore or cluster hardware. With an active user community that Gupta says NVIDIA supports, however, this “sweet spot” is being increasingly exploited with new plugins and capabilities to make R easier to use with a GPU—and with all the performance benefits.
According to Gupta, it shouldn’t come as a surprise that statistics-based approaches to data mining like R are good fits for GPU computing. “If you look at statistical analysis, there is a large data set and this is captured in a matrix—really, all they do on that data are operations, math, where they’re multiplying two matrices together.” In essence, this is the same approach that the supercomputing benchmark takes to evaluating the performance of the largest, fastest systems on earth takes. Linpack , which the HPC community uses to gauge performance, is a benchmark that NVIDIA has been running away with, a fact that bodes well for the R analytics community evaluating GPUs, he says.
As one can see from the Linpack benchmark ratings (not to mention NVIDIA’s rather remarkable rise in the HPC ranks), performance is the name of NVIDIA’s game. But while HPC is the bread and butter to the Tesla group’s approach, Gupta concurs with other vendors that have strong roots in the high performance computing (supercomputing) community—he too feels the big data hype has the ability to “take HPC mainstream.”
While the lofty world of supercomputing (and until the last couple of years, the GPGPU scene) has often been hidden from mainstream view, the big data buzz and conversations about hyper-acceleration of massive, complex data analytics environments is finding its way into more enterprise settings. “Big data” has popularized the system and software approaches of companies like Walmart and Target, says Gupta.
“What the big data hype has done is helped the industry realize that they can analyze all their data” in the same way large companies like Walmart, Target and Amazon do.” He suggests that from business optimization to real-time analysis of social data, the buzz has heaped new attention on problems that the HPC community has been solving long before the hype happened. The key to the explosion in interest, however, is that the systematic approaches to mining big data for intelligence and optimization that worked for industry giants leaked into the everyday, pushing the ecosystem to develop new (or at least tweaked) systems and software that lets Joe Business Owner replicate what the big players are doing to the appropriate scale.
While many companies on both the hardware and software side of high performance computing and advanced analytics have climbed aboard the Big Data Express with their outreach efforts, NVIDIA says they’ve been able to settle back and let the ISVs come to them. The company is relying on the solid developer community that leverages GPUs for boosting analytics algorithms to tweak upcoming and existing frameworks to plow through massive data sets faster.
While they don’t eschew the importance of the big data buzz, especially among system and chip vendors, Gupta says pushing GPU acceleration is not about hardware in the first place—it’s about solving unique software challenges. Many startups and established software vendors have explored GPUs because of specific problems they’re trying to solve and speed, he says, pointing to a detailed list of application areas and companies that are looking to GPGPU boosts to tease peak analytics performance.
“If you look at the list of ISVs that are working with big data and GPUs, you’ll see the attraction of GPUs is wide and deep. Oracle, SAP, and many other smaller companies have built businesses around using GPUs for big data analytics.”
In addition to some key research items for GPUs in big data analytics environments that Gupta cited loosely, he pointed to some interesting use cases. He said that big users, including the federal government, which hopes to be able to predict future “Arab Spring” events are accelerating their big data mining, in addition to an unnamed “major, popular” mobile app company that is handling massive data in near real-time.