Making Hadoop Relevant to HPC
Despite its proven ability to affordably process large amounts of data, Apache Hadoop and its MapReduce framework are being taken seriously only at a subset of U.S. supercomputing facilities and only by a subset of professionals within the HPC community.
That in a nutshell is the contention of Glenn K. Lockwood of the San Diego Supercomputer Center, who wonders why Hadoop remains at the fringe of high-performance computing while looking for ways to move it into the HPC mainstream.
“Hadoop is in a very weird place within HPC,” Lockwood asserts. One reason, he said in a recent blog post, is that “unlike virtually every other technology that has found successful adoption within research computing, Hadoop was not designed by HPC” users.
Other observers like HPCWire Editor-in-Chief Nicole Hemsoth are more sanguine about Hadoop’s role in the data-intensive segments of scientific computing applications. Hemsoth recently noted that MapReduce is slowly being adapted to an HPC environment and that potential data-intensive applications for Hadoop include deploying it across different parallel file systems and for handling scheduling.
Still, Lockwood–who attended Datanami’s inaugural Leverage Big Data event last week and participated in a panel discussion titled Hadoop: Real-World Use Versus Hype,”–warns that it continues to suffer from a not-invented-here bias. Hadoop, developed by Yahoo, and MapReduce, which was developed by Google, were created as services. Hence, Lockwood notes, “Hadoop is very much an interloper in the world of supercomputing.”
There are other barriers to HPC adoption of Hadoop, Lockwood insists.
Among them is the fact that it is written in Java. This open-source approach to programming runs counter to the foundations of HPC that prefer optimizing code for specific hardware. However, specific research fields like genome sequencing could give a boost to cheap, open-source approaches to handling huge data sets.
Another unresolved challenge, Lockwood argues, is that Hadoop “reinvents a lot of functionality that has existed in HPC for decades, and it does so very poorly.” For example, he said a single Hadoop cluster could support only three concurrent jobs simultaneously. Beyond that, performance suffers.
Moreover, Lockwood maintains that Hadoop does not support scalable network topologies like multidimensional meshes. Add to that, the Hadoop Distributed File System (HDFS) “is very slow and very obtuse” when compared with common HPC parallel file systems like Lustre and the General Parallel File System.
Lockwood also asserts that Hadoop’s evolution has been “backwards” in that it “entered HPC as a solution to a problem which, by and large, did not yet exist.” The result, he adds, is “a graveyard of software, documentation, and ideas that are frozen in time and rapidly losing relevance as Hadoop moves on.”
Despite a growing list of challenges, Lockwood said a number of steps must be taken to transform Hadoop into a core HPC technology. Among them are re-implementing MapReduce in an HPC-friendly way. Another would be reversing Hadoop’s evolutionary path by incorporating into it HPC technologies.
“This will allow HPC to continuously fold in new innovations being developed in Hadoop’s traditional competencies–data warehousing and analytics–as they become relevant to scientific problems,” Lockwood concludes.