Rethinking Hadoop for HPC
Hadoop’s momentum has caught the eye of those in the high performance computing (HPC) community, who want to participate and benefit from the fast pace of development. However, the relatively poor performance and high latency of Hadoop applications is a real concern. To address the problem and make Hadoop a better fit for HPC resources, some are exploring how they can rewrite certain components of Hadoop in a more HPC-like manner.
Those in the HPC world look at what’s happening in Hadoop and other big data frameworks with a combination of envy and fear. They’re envious because they, too, want to ride the great wave of innovation that is occuring within the Hadoop and big data communities, and partake of the bounty of frameworks that the open source community has developed. HPC programmers who have spent decades doing “big data analytics” in academic settings before “big data analytics” became a mainstream thing must feel a certain degree of validation and awe at the incredible momentum Hadoop has garnered.
But HPC people are also fearful of Hadoop and its ilk. They recoil at the overhead built into the software to get Hadoop to run on commodity hardware. They look at the commercial Java coding with suspicion. Hadoop, after all, came from the commercial sector, whereas the HPC codes they run on supercomputers were written by scientists, for scientists.
“Hadoop looks funny,” writes Glenn Lockwood of the San Diego Supercomputer Center. Lockwood, who runs the MyHadoop service at SDSC, last year came up with a list of reasons why Hadoop is so “weird” to HPC people. The use of TCP, REST, and RPC for inter-processor communication is slow, multi-tenancy is a joke, the schedulers are terrible, and HDFS is “very slow and very obtuse,” Lockwood says.
Lockwood’s views are widespread in the HPC world. But Lookwood and his contemporaries in the field also realize that Hadoop does some things right. While there are obvious differences between the parallel worlds of HPC and big data analytics, finding ways they can co-exist and benefit from each other is the greater goal.
Bridging Hadoop and HPC
One of the groups looking to make Hadoop more palatable to the HPC crowd is the Advanced Information Systems division of General Dynamics, the giant defense contractor based in Virginia. The company recently completed the first round of testing for a methodology that involves ripping out the pieces of code that make big data analytics frameworks slow on distributed commodity clusters, and rewriting them so they can run much faster on parallel HPC systems.
According to William Leinberger, a senior principal engineer and data scientist with General Dynamics AIS, the company was able to boost the performance of the Apache Hama graph analytics engine by a factor of 10 by replacing three Java classes having to do with barrier synchronization, data transfer, and some initialization.
“Our view from HPC is that the application performance should achieve what the hardware is capable of producing,” Leinberger tells Datanami. “The big data people build frameworks like Apache Hama to make it easy for non computer-science people to write analytics against large data sets. But in doing so they’re assuming the underlying system is a commodity pile of hardware that’s prone to failure so they inject in a lot of software overhead.”
The inter-processor communication used in Apache Hama is based on Hadoop and has a lot of “chatter” going back and forth to make sure things were delivered reliably. In contrast, the hardware in HPC makes that communication reliable. “We removed 1 percent of the code base and wrote it the way you would do in the high performance computing field and got some performance gains,” he says.
The code GD AIS rewrote in MPI came from Hadoop. “It’s used in MapReduce and other frameworks that the big data open source community put out,” Leinberger says. “So the methodology we have, the approach we have, is applicable to a larger range of big data problems…The bigger project is how do we get rid of the software overhead of these frameworks.”
Parallel Big Data
General Dynamics AIS serves some of the largest government customers and is a key partner in helping our country’s intelligence community get the most out of their HPC resources and supercomputers. Much of the work is classified, obviously, but Leinberger shared some general observations about how those groups are going about solving problems.
The intelligence community has many of the same challenges faced by enterprise organizations. Namely, they want to use the latest software to do big data analytics, and they want to do so without unnecessarily burdening the IT resources. The time it takes to generate an insight is also critical–perhaps even more so when viewed from the intelligence community’s capability to complete a mission.
“They have the same problem that big data people are trying to solve,” Leinberger says. “The users are not hardcore HPC programmers. They’re mathematicians and analysts and other types. They don’t want to sit down and spend a year writing code to solve one little problem. They want to spend a few hours to getting something up and running to see if it’s something they want to do, and later it gets hardened.”
Iteration is key to GD AIS’s clients. They want the ability to try something new, fail, and then move onto something else. In Leinberger’s view, combining the capabilities of big data analytics frameworks with the speed of HPC resources will make them more productive.
Big Data’s Path to HPC Productivity
Leinberger’s group started out with Apache Hama, but they expect to move on to other big data analytics frameworks, perhaps Apache Spark or Apache Accumulo. Ensuring the highest levels of performance when running these frameworks on HPC resources will help lower one of the barriers to entry.
But there are other challenges to productivity that will not be solved by simply re-writing Hadoop’s communication system in MPI, supporting GPFS in Hadoop (as IBM is doing), or implementing an Infiniband communication layer in Hadoop, which Professor Dhableswar K. (DK) Panda at Ohio State is having success with.
Those are good starts but they don’t go far enough to address another set of problems, which has to do with co-mixing big data analytic deployments on HPC resources. “HPC people don’t want…to stand up and dedicate a big data platform and leave it there,” Leinberger says. “They don’t want to up MapReduce on their big Cray and not be able to use the Cray for anything else.”
The HPC world is moving toward the concept of dynamic deployments, which will enable big data analytic frameworks to be quickly configured on supercomputers, run, and then torn down to make way for something else. This model will not be confined to a certain supercomputer, however, and will involve multiple systems using multiple frameworks to get the answer.
“That’s the way we think of this problem for our customers,” Leinberger says. “They want to stand up this analytic pipeline, where this framework is going to run the first part of the [workload], then the next step computationally will be on a different framework, a different machine. …The smart data movement between those frameworks is a big part of it. It’s not just a dynamic deployment of framework on one system, but how do you do that on the enterprise- level and get them all coordinated, especially with the data movement and data staging. That’s where we’re going this year.”
Big data analytic frameworks like Hadoop have evolved along different paths than traditional HPC technologies up to this point. But from here on out, it would appear that they will be much more closely aligned, and that’s good for any users of these technologies, in government and in the commercial enterprise.