May 27, 2014

Making Hadoop Relevant to HPC

George Leopold

Despite its proven ability to affordably process large amounts of data, Apache Hadoop and its MapReduce framework are being taken seriously only at a subset of U.S. supercomputing facilities and only by a subset of professionals within the HPC community.

That in a nutshell is the contention of Glenn K. Lockwood of the San Diego Supercomputer Center, who wonders why Hadoop remains at the fringe of high-performance computing while looking for ways to move it into the HPC mainstream.

“Hadoop is in a very weird place within HPC,” Lockwood asserts. One reason, he said in a recent blog post, is that “unlike virtually every other technology that has found successful adoption within research computing, Hadoop was not designed by HPC” users.

Other observers like HPCWire Editor-in-Chief Nicole Hemsoth are more sanguine about Hadoop’s role in the data-intensive segments of scientific computing applications. Hemsoth recently noted that MapReduce is slowly being adapted to an HPC environment and that potential data-intensive applications for Hadoop include deploying it across different parallel file systems and for handling scheduling.

Still, Lockwood–who attended Datanami’s inaugural Leverage Big Data event last week and participated in a panel discussion titled Hadoop: Real-World Use Versus Hype,”–warns that it continues to suffer from a not-invented-here bias. Hadoop, developed by Yahoo, and MapReduce, which was developed by Google, were created as services. Hence, Lockwood notes, “Hadoop is very much an interloper in the world of supercomputing.”

There are other barriers to HPC adoption of Hadoop, Lockwood insists.

Among them is the fact that it is written in Java. This open-source approach to programming runs counter to the foundations of HPC that prefer optimizing code for specific hardware. However, specific research fields like genome sequencing could give a boost to cheap, open-source approaches to handling huge data sets.

Another unresolved challenge, Lockwood argues, is that Hadoop “reinvents a lot of functionality that has existed in HPC for decades, and it does so very poorly.” For example, he said a single Hadoop cluster could support only three concurrent jobs simultaneously. Beyond that, performance suffers.

Moreover, Lockwood maintains that Hadoop does not support scalable network topologies like multidimensional meshes. Add to that, the Hadoop Distributed File System (HDFS) “is very slow and very obtuse” when compared with common HPC parallel file systems like Lustre and the General Parallel File System.

Lockwood also asserts that Hadoop’s evolution has been “backwards” in that it “entered HPC as a solution to a problem which, by and large, did not yet exist.” The result, he adds, is “a graveyard of software, documentation, and ideas that are frozen in time and rapidly losing relevance as Hadoop moves on.”

Despite a growing list of challenges, Lockwood said a number of steps must be taken to transform Hadoop into a core HPC technology. Among them are re-implementing MapReduce in an HPC-friendly way. Another would be reversing Hadoop’s evolutionary path by incorporating into it HPC technologies.

“This will allow HPC to continuously fold in new innovations being developed in Hadoop’s traditional competencies–data warehousing and analytics–as they become relevant to scientific problems,” Lockwood concludes.

The Real Challenges of ExaScale

What Can GPFS on Hadoop Do For You?

Applications: Research Analytics

Technologies: Network, Processors, Storage, Systems

Sectors: Academia, Government

Tags: Hadoop, hpc, Java, mapreduce

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Making Hadoop Relevant to HPC

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Making Hadoop Relevant to HPC

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link