October 1, 2012

Researchers Target Storage, MapReduce Interactions

Ian Armas Foster

Increasing Hadoop’s efficiency is an important aspect of continuing its growth. As a result, researchers from the University of Illinois at Urbana-Champaign and Yahoo conducted a unique study on how MapReduce was accessing files in various Hadoop clusters.

The team behind the research says that from what they’ve been able to see, this effort represents the first examination of how MapReduce workloads interact with the storage layer.

To get the lay of the storage land, they monitored two Hadoop clusters over six months: the 4100+ node PROD cluster, whose jobs run at regular intervals (daily, weekly, etc.) and the 1900+ node R&D cluster, which focuses on research and development: testing jobs for future use on the PROD.

Determining popularity, according to the study, is remarkably difficult. The amount of files is changing constantly as files are added and deleted. As such, dividing a particular file or a group of files’ access by the amount of the files in the namespace produces numbers that are of little mathematical value.

Instead, they flipped the question, asking how many files rarely get queried. According to the report, “80 − 90% of the files are accessed no more than 10 times during the full 6-month period.” There exist so many unpopular files due to the large amount of files that get accessed and then deleted, termed ‘short-lived files’ by the report.

The report also found that the vast majority of the files that were being accessed were relatively young. “What percentage of accesses target files that are at most one week old? The answer, is surprisingly close for both clusters: 90.31% (PROD) and 86.87% (R&D). To provide some perspective, a media server study found that the ﬁrst ﬁve weeks of a ﬁle’s existence account for 70 − 80% of their accesses.”

What is surprising about that closeness is that the R&D system may be expected to run its test jobs on older, less relevant data.

A remarkable amount of jobs (29-30%) used files that were less than two minutes old. However, the report suspects that has to do with the MapReduce job duration. “During the same 6-month period, 34.75% − 57.46% (PROD and R&D) of the successful jobs had a total running time of 1 minute or less (including the time waiting on the scheduler queue).”

Either way, this heavy reliance suggests that an emphasis on speed should be placed on the newer files in future designs. “Their high file churn and skewed access towards young files, among others, should be further studied and modeled to enable designers of next generation file systems to optimize their designs to best meet the requirements of these emerging workloads.”

Related Articles

The Algorithmic Magic of Trendspotting

Study Stacks MySQL, MapReduce and Hive

Six Super-Scale Hadoop Deployments

Technologies: Frameworks, Storage

Sectors: Academia

Tags: Hadoop, mapreduce, prod, prod cluster, storage

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Researchers Target Storage, MapReduce Interactions

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Researchers Target Storage, MapReduce Interactions

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link