DataTorrent
Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan


September 11, 2012

The GPU "Sweet Spot" for Big Data


Although GPU acceleration is a relatively little-discussed topic in comparison to the hype around in-memory databases and emerging software frameworks, it is not because of a lack of research or set of use cases.

In fact, according to Sumit Gupta, senior manager in NVIDIA’s Tesla GPU High Performance Computing group, GPUs have definitely extended beyond their traditional base in scientific HPC centers. The Tesla manager shared news about some major enterprise customers who are using GPUs for their large-scale data mining initiatives, and talked briefly about some interesting GPU-boosted social media projects the federal government has been exploring.

New frameworks for tackling the “Three V” elements of big data (volume, variety and velocity), including MapReduce—not to mention more mature approaches to data mining (R, for example) are not pushing GPU bakers into new territory, Gupta suggests.

At the heart of this “nothing new” approach to the din around big data, Gupta says that what has made GPUs ideal for large-scale analytics in the past hasn’t shifted much. He notes that most data mining applications that leverage classic algorithms for classifying and analyzing data, including SVM (not to mention newer open source projects like Apache Mahout) all boast a C kernel, which makes them prime candidates for GPU computing approaches. Putting a real-time spin on these applications requires complex systems and applications, but the base classification methods supporting these (again, like SVM) haven’t changed.

Gupta stresses that GPUs get into the underling server and whatever is running on a single node runs faster. This means that it doesn’t change the fundamental approach to solving problem, it means that the algorithm behind the problem can be more complex with the added performance, which he says leads to better accuracy for everything from pattern matching to fine-tuned classification.

While this is a benefit for the end user, there are some kinks that need to be worked out, which Gupta says an internal team dedicated to analytics is solving.  These specialists are not only plugging away end-to-end solutions that create a complete workflow in a large-scale data analytics environment, there’s trying to address a critical I/O issue. In some cases, Gupta says,“the GPU accelerates a single server so much that better, faster disks and networks are needed, so in essence, it moves the problem.”

Outside of the “over-performance” of the GPU nodes, NVIDIA is watching how the recent popularization of MapReduce is marrying GPU approaches. One can argue that MapReduce has been revolutionary in its ability to pull some complexity out of the developer’s queue by removing the need to try to compute complex problems on a single machine. Instead, MapReduce maps a problem into pieces before bringing it back via a reduction to aggregate the results. It’s not just about the simplification, says Gupta. When coupled with the speedup on GPU nodes, the performance permits far better accuracy with the added algorithmic complexity the system can handle.

The GPU and MapReduce combination has been the subject of some notable research and is finding its way into more popular use in GPU environments, says Gupta. Azinta Systems founder and advocate for using GPUs to boost large-scale data mining, Suleiman Shehu agrees that GPUs can revolutionize large-scale data mining by significantly speeding up the processing of data mining algorithms. He points to the K-Means clustering algorithm as a prime example, stating, “the GPU-accelerated version was found to be 200x-400x faster than the popular benchmark program MimeBench running on a single core CPU, and 6x-12x faster than a highly optimised CPU-only version running on an 8 core CPU workstation.”

Shehu goes on to note that the volume factor of big data isn’t a hindrance for the GPU, noting that in one example, a “data set with 1 billion 2-dimensional data points and 1,000 clusters, the GPU-accelerated K-Means algorithm took 26 minutes (using a GTX 280 GPU with 240 cores) whilst the CPU-only version running on a single-core CPU workstation, using MimeBench, took close to 6 days (see original paper here)…Substantial additional speed-ups are expected were the tests conducted today on the latest Fermi GPUs with 480 cores and 1 TFLOPS performance.”

High performance algorithms and approaches go beyond MapReduce, however, says Gupta, pointing to the overall popularity of the open source statistical software environment around R. He pointed to the large number of GPU plugins for R, noting that R represents a “sweet spot” for the GPU—not to mention a very sweet spot for enterprise analytics needs.

Next -- On R and "Mainstream HPC" >

 

The big problem with R, as many in the large community around it have noted, is that it comes with no parallelism. As stated in piece discussing the challenges of R in large-scale environments, the other problem is that most of the R function implementations can only be executed as separate instances on multicore or cluster hardware. With an active user community that Gupta says NVIDIA supports, however, this “sweet spot” is being increasingly exploited with new plugins and capabilities to make R easier to use with a GPU—and with all the performance benefits.

According to Gupta, it shouldn’t come as a surprise that statistics-based approaches to data mining like R are good fits for GPU computing. “If you look at statistical analysis, there is a large data set and this is captured in a matrix—really, all they do on that data are operations, math, where they’re multiplying two matrices together.” In essence, this is the same approach that the supercomputing benchmark takes to evaluating the performance of the largest, fastest systems on earth takes. Linpack , which the HPC community uses to gauge performance, is a benchmark that NVIDIA has been running away with, a fact that bodes well for the R analytics community evaluating GPUs, he says.

As one can see from the Linpack benchmark ratings (not to mention NVIDIA’s rather remarkable rise in the HPC ranks), performance is the name of NVIDIA’s game. But while HPC is the bread and butter to the Tesla group’s approach,  Gupta concurs with other vendors that have strong roots in the high performance computing (supercomputing) community—he too feels the big data hype has the ability to “take HPC mainstream.”

While the lofty world of supercomputing (and until the last couple of years, the GPGPU scene) has often been hidden from mainstream view, the big data buzz and conversations about hyper-acceleration of massive, complex data analytics environments is finding its way into more enterprise settings. “Big data” has popularized the system and software approaches of companies like Walmart and Target, says Gupta.

“What the big data hype has done is helped the industry realize that they can analyze all their data” in the same way large companies like Walmart, Target and Amazon do.” He suggests that from business optimization to real-time analysis of social data, the buzz has heaped new attention on problems that the HPC community has been solving long before the hype happened. The key to the explosion in interest, however, is that the systematic approaches to mining big data for intelligence and optimization that worked for industry giants leaked into the everyday, pushing the ecosystem to develop new (or at least tweaked) systems and software that lets Joe Business Owner replicate what the big players are doing to the appropriate scale.

While many companies on both the hardware and software side of high performance computing and advanced analytics have climbed aboard the Big Data Express with their outreach efforts, NVIDIA says they’ve been able to settle back and let the ISVs come to them. The company is relying on the solid developer community that leverages GPUs for boosting analytics algorithms to tweak upcoming and existing frameworks to plow through massive data sets faster.

While they don’t eschew the importance of the big data buzz, especially among system and chip vendors, Gupta says pushing GPU acceleration is not about hardware in the first place—it’s about solving unique software challenges. Many startups and established software vendors have explored GPUs because of specific problems they’re trying to solve and speed, he says, pointing to a detailed list of application areas and companies that are looking to GPGPU boosts to tease peak analytics performance.

“If you look at the list of ISVs that are working with big data and GPUs, you’ll see the attraction of GPUs is wide and deep. Oracle, SAP, and many other smaller companies have built businesses around using GPUs for big data analytics.”

In addition to some key research items for GPUs in big data analytics environments that Gupta cited loosely, he pointed to some interesting use cases. He said that big users, including the federal government, which hopes to be able to predict future “Arab Spring” events are accelerating their big data mining, in addition to an unnamed “major, popular” mobile app company that is handling massive data in near real-time.

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 

Most Read Features

Most Read News

Most Read This Just In



Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

ISC'14

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
StampedeCon
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014
ISC'14
Leipzig
Germany

» View/Search Events

» Post an Event