Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
ISC'13

October 01, 2012

Researchers Target Storage, MapReduce Interactions


Increasing Hadoop’s efficiency is an important aspect of continuing its growth. As a result, researchers from the University of Illinois at Urbana-Champaign and Yahoo conducted a unique study on how MapReduce was accessing files in various Hadoop clusters.

The team behind the research says that from what they’ve been able to see, this effort represents the first examination of how MapReduce workloads interact with the storage layer.

To get the lay of the storage land, they monitored two Hadoop clusters over six months: the 4100+ node PROD cluster, whose jobs run at regular intervals (daily, weekly, etc.) and the 1900+ node R&D cluster, which focuses on research and development: testing jobs for future use on the PROD.

Determining popularity, according to the study, is remarkably difficult. The amount of files is changing constantly as files are added and deleted. As such, dividing a particular file or a group of files’ access by the amount of the files in the namespace produces numbers that are of little mathematical value.

Instead, they flipped the question, asking how many files rarely get queried. According to the report, “80 − 90% of the files are accessed no more than 10 times during the full 6-month period.” There exist so many unpopular files due to the large amount of files that get accessed and then deleted, termed ‘short-lived files’ by the report.

The report also found that the vast majority of the files that were being accessed were relatively young. “What percentage of accesses target files that are at most one week old? The answer, is surprisingly close for both clusters: 90.31% (PROD) and 86.87% (R&D). To provide some perspective, a media server study found that the first five weeks of a file’s existence account for 70 − 80% of their accesses.”

What is surprising about that closeness is that the R&D system may be expected to run its test jobs on older, less relevant data.

A remarkable amount of jobs (29-30%) used files that were less than two minutes old. However, the report suspects that has to do with the MapReduce job duration. “During the same 6-month period, 34.75% − 57.46% (PROD and R&D) of the successful jobs had a total running time of 1 minute or less (including the time waiting on the scheduler queue).”

Either way, this heavy reliance suggests that an emphasis on speed should be placed on the newer files in future designs. “Their high file churn and skewed access towards young files, among others, should be further studied and modeled to enable designers of next generation file systems to optimize their designs to best meet the requirements of these emerging workloads.”

Related Articles

The Algorithmic Magic of Trendspotting

Study Stacks MySQL, MapReduce and Hive

Six Super-Scale Hadoop Deployments

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
SGI Hadoop

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event