Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
Rogue Wave

September 05, 2012

MapReduce Makes Further Inroads in Academia


Most conversations about Hadoop and MapReduce tend to filter in from enterprise quarters, but if the recent uptick in scholarly articles extolling its benefit for scientific and technical computing applications is any indication, the research world might have found its next open source darling.

Of course, it’s not just about making use of the approach—for many researchers, it’s about expanding, refining and tweaking the tool to make it suitable for new, heavy-hitting class of applications. As a result, research to improve MapReduce’s functionality and efficiency flourishes, which could eventually provide some great trickle-down technology for the business users as well.

As one case among an increasing number, researchers Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil are working to extend the capabilities of MapReduce. Their approach to MapReduce sought to tackle one of the more complex issues for MapReduce on high performance computing hardware. In this case, the mighty scheduling problem was the target.

The team recently proposed a new algorithm that would enhance MapReduce’s work rate and job scheduling called MapReduce Job Adaptor. Neves et al presented their algorithm in a recent paper.

According to Neves et al, MapReduce sets the tone for big data analysis. “MapReduce,” the paper notes “has become a de facto standard for largescale data analysis. Moreover, it has also attracted the attention of the HPC community due to its simplicity, efficiency and highly scalable parallel model.”

Further, they note that researchers use MapReduce with more frequency as a result of the success massive websites such as Facebook and Google have enjoyed as a result of MapReduce. “The MapReduce model is in increasing adoption by several researchers, including the ones that used to rely on HPC solutions. Much of this enthusiasm is due to the highly visible cases where MR has been successfully used by companies like Google, Yahoo, and Facebook.”

With that being said, Neves et al note that there exist several efficiency problems to be worked out with MapReduce. They believe HPC clusters can be used to better execute MapReduce jobs. “Users and computing laboratory administrators may benefit from using already existing HPC clusters to execute MapReduce jobs. While MapReduce implementations provide a straightforward job submission process which involves the whole cluster, HPC users submit their jobs to a Resource Management System and need to specify the number of nodes and amount of time that should be allocated for complete the job execution.”

The adaptor works to translate MapReduce jobs into a form that can be read and executed in an HPC cluster, utilizing the cluster’s Resource Management System. As the paper explains, “Instead of always using the maximum amount of nodes and time to execute the MR job, the adaptor allocates a cluster partition which minimizes the turnaround time of the job. It does that by interacting with the RMS to get free areas (slots) in the job requests queue. Using a profile of the MR job, it estimates the job completion time for each free slot and selects the one that yields the minimum turnaround time.”

One benefit is that the job time estimation required to input jobs to the RMS is shifted from the user to the automated adaptor. The idea here is that humans may not be perfect in estimating job times. The adaptor may not be either, but it could potentially learn in a more logical and computer cluster-friendly fashion.

Neves et al tested their algorithm, producing promising results. Using data obtained from Facebook, they created nine bins of types of MapReduce jobs Facebook performs. The bins range from small but high-frequency jobs (Bin one contains jobs which only require one map task but those jobs make up 39% of Facebook’s total) to less common huge jobs (they estimated 2,400 map tasks for 3%).

According to the paper, the Job Adaptor enhanced performance for each bin, sometimes in resounding fashion. While it took what they called the “naïve” algorithm about 500 minutes to execute jobs in bins 1-5 (anywhere from one to a hundred map tasks coupled with the occasional reduce task), turnaround with the adaptor ranged from one to two minutes for the first couple of bins to about 300 minutes for the fifth. Bin eight, which held jobs that required on average 800 map tasks and 180 reduce tasks, was closest, only being about 50 minutes apart (1800-1750), but the turnaround was still decreased with the adaptor.

It will be interesting to see if this newly published algorithm gains a hold in the HPC community. Either way, an algorithm that increases MapReduce’s efficiency is never a bad thing, and again, further research into the core functionality of MapReduce for scientific applications could trickle down to the enterprise over time to improve the speed and clarity of decisions.

Related Articles

Study Stacks MySQL, MapReduce and Hive

Six Super-Scale Hadoop Deployments

How 8 Small Companies are Retooling Big Data

Cloudera CTO Reflects on Hadoop Underpinnings

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event