Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
Rogue Wave

August 14, 2012

Cloudera CTO Reflects on Hadoop Underpinnings


“All of this data felt like an ocean and all these technologies only offered a small straw. We don’t want to use a straw, we want to jump in and swim in the ocean,” remarked Cloudera’s Co-Founder and Chief Technology Officer Amr Awadallah.

The Hadoop distro company’s CTO and co-founder discussed his company’s beginnings at the Cloud 2012 conference in Honolulu, Hawaii. He related his company’s beginnings to the challenges of big data as well as talking about the improvements Hadoop has made over the several years along with the improvements his company, Cloudera, has made themselves to Hadoop with the recent release of CDH4.

Awadallah related Hadoop to a simple computer operating system, nothing that an operating serves two purposes: to store files and to run applications on top of those files. According to Awadallah, Hadoop does that as well but obviously on a much larger scale. Further, Hadoop offers what Awadallah calls ‘scalability of humans.’ What he means by that is that developers who generate queries for two or three nodes can run those exact queries on hundreds of nodes as a result of Hadoop.

According to Awadallah, there were four ‘headaches’ whose problems they were looking to alleviate: data growing rapidly, the many forms of data, data’s dynamic schemas, and the increased cultural need to “leverage data as an asset.” When he looked at the high performance computing landscape in 2006 while working for Yahoo, he realized that none of their offerings fit his needs. He thought the answers were in the cloud, and that this was the dawning of the “cloud era.”

Alas, “Big data,” Awadallah says “does not lend itself well to external clouds today. It might change in the future but there are too many logistic boundaries.” So they shifted their focus to the superior Hadoop technology but kept the now-recognizable “Cloudera” moniker.

Hadoop is also very good at preventing system failures. Awadallah says it has “so much smarts in it that predicts and routes against failure of hardware, failure of network, so that only one or two system admins are needed for every thousand nodes.” Combining the ability to run local queries on a system of a thousand nodes with self-servicing and self-diagnosing capabilities such that only one or two people are required to maintain it indeed scales a large computing cluster to a handful of humans.

Awadallah notes the removal of the single name node failure in talking about recent improvements his company has made to Hadoop with CDH4. Namely, those improvements include even more scalability (up to tens of thousands of nodes and breaking the 4,000 node barrier), faster performance (up to 100% faster in latency intensive jobs), and higher availability. Specifically, Awadallah mentions HBase replication which will help head off catastrophic data loss in the (less likely) event of a systems failure.

While Awadallah noted that there remained challenges in virtual infrastructure like Hadoop, namely that it relies on an inefficient centralized structure, he is as optimistic for the future of Cloudera and Hadoop as is he is pleased with Hadoop’s progress so far.

Related Stories

Cloudera Plots Enterprise Invasion

Six Super-Scale Hadoop Deployments

How 8 Small Companies are Retooling Big Data

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 17-18, 2013
Forecast 2013
San Francisco, CA
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event