Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


August 17, 2012

Marching Hadoop to Windows


Bringing Hadoop to Windows and the two-year development of Hadoop 2.0 are two of the more exciting developments brought up by Hortonworks’s Cofounder and CTO, Eric Baldeschwieler, in a talk before a panel at the Cloud 2012 Conference in Honolulu.

The panel, which was also attended by Baldeschwieler’s Cloudera counterpart Amr Awadallah, focused on insights into the big data world, a subject Baldeschwieler tackled almost entirely with Hadoop. The eighteen-minute discussion also featured a brief history of Hadoop’s rise to prominence, improvements to be made to Hadoop, and a few tips to enterprising researchers wishing to contribute to Hadoop.

“Bringing Hadoop to Windows,” says Baldeschwieler “turns out to be a very exciting initiative because there are a huge number of users in Windows operating system.” In particular, the Excel spreadsheet program is a popular one for business analysts, something analysts would like to see integrated with Hadoop’s database. That will not be possible until, as Baldeschwieler notes, Windows is integrated into Hadoop later this year, a move that will also considerably expand Hadoop’s reach.

However, that announcement pales in comparison to the possibilities provided by the impending Hadoop 2.0. “Hadoop 2.0 is a pretty major re-write of Hadoop that’s been in the works for two years. It’s now in usable alpha form…The real focus in Hadoop 2.0 is scale and opening it up for more innovation.” Baldeschwieler notes that Hadoop’s rise has been result of what he calls “a happy accident” where it was being developed by his Yahoo team for a specific use case: classifying, sorting, and indexing each of the URLs that were under Yahoo’s scope.

What ended up happening was that other Yahoo teams requested use of the Hadoop nodes and found success with it, leading to a much more significant investment from Yahoo. “Yahoo took this (Hadoop) prototype and then built an internal service that now runs on 42,000 computers with roughly 200 petabytes of raw storage involved and it took about 300 person-years of investment and open source software to make this thing work.” From there, folks like Baldeschwieler and Awadallah went off and formed other projects like Hortonworks and Cloudera to further add to Hadoop. 

While Hadoop’s rise makes for a fun success story, its status as somewhat of a happy accident has led to some inefficiencies and limitations, such that a new version entirely was necessary to continue its growth. “The existing Hadoop 1.0 base runs on about 4,000 computers whereas the target design is about 10,000 and that takes Moore’s law forward a few years. Our current target computer has about 12 TB of disk, the new one would have 36.”

Hadoop 2.0 is more than about improving its scale, however. Baldeschwieler would like to see programmers and data scientists able to work with more than MapReduce, in essence making it more ‘pluggable.’ He would also like to see new varieties of files introduced to Hadoop through version 2.0.

Making 2.0 more pluggable may also solve another Hadoop problem businesses are having. Baldeschwieler mentioned that every Fortune 500 company has Hadoop running in some form but many businesses are slow to make full use of it. Making Hadoop more pluggable will not help the businesses that hear of Hadoop, want to get into big data, and end up buying several nodes to accomplish that end without much thought.

 It will however assist those with competent technology departments that have analytics tools but are unable to integrate them with Hadoop for whatever reason. “We need to make sure that there’s the right APIs for everyone who’s building data products to plug into Hadoop in various ways.”

Finally, someone has to be doing all this research into the advancement of Hadoop into its second version. Baldeschwieler notes that while the Hadoop community welcomes good ideas and contributions, one should build a reputation in the community by doing interesting research with Hadoop before trying to add to it.

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
SGI Hadoop

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event