August 17, 2012

Marching Hadoop to Windows

Datanami Staff

Bringing Hadoop to Windows and the two-year development of Hadoop 2.0 are two of the more exciting developments brought up by Hortonworks’s Cofounder and CTO, Eric Baldeschwieler, in a talk before a panel at the Cloud 2012 Conference in Honolulu.

The panel, which was also attended by Baldeschwieler’s Cloudera counterpart Amr Awadallah, focused on insights into the big data world, a subject Baldeschwieler tackled almost entirely with Hadoop. The eighteen-minute discussion also featured a brief history of Hadoop’s rise to prominence, improvements to be made to Hadoop, and a few tips to enterprising researchers wishing to contribute to Hadoop.

“Bringing Hadoop to Windows,” says Baldeschwieler “turns out to be a very exciting initiative because there are a huge number of users in Windows operating system.” In particular, the Excel spreadsheet program is a popular one for business analysts, something analysts would like to see integrated with Hadoop’s database. That will not be possible until, as Baldeschwieler notes, Windows is integrated into Hadoop later this year, a move that will also considerably expand Hadoop’s reach.

However, that announcement pales in comparison to the possibilities provided by the impending Hadoop 2.0. “Hadoop 2.0 is a pretty major re-write of Hadoop that’s been in the works for two years. It’s now in usable alpha form…The real focus in Hadoop 2.0 is scale and opening it up for more innovation.” Baldeschwieler notes that Hadoop’s rise has been result of what he calls “a happy accident” where it was being developed by his Yahoo team for a specific use case: classifying, sorting, and indexing each of the URLs that were under Yahoo’s scope.

What ended up happening was that other Yahoo teams requested use of the Hadoop nodes and found success with it, leading to a much more significant investment from Yahoo. “Yahoo took this (Hadoop) prototype and then built an internal service that now runs on 42,000 computers with roughly 200 petabytes of raw storage involved and it took about 300 person-years of investment and open source software to make this thing work.” From there, folks like Baldeschwieler and Awadallah went off and formed other projects like Hortonworks and Cloudera to further add to Hadoop. 

While Hadoop’s rise makes for a fun success story, its status as somewhat of a happy accident has led to some inefficiencies and limitations, such that a new version entirely was necessary to continue its growth. “The existing Hadoop 1.0 base runs on about 4,000 computers whereas the target design is about 10,000 and that takes Moore’s law forward a few years. Our current target computer has about 12 TB of disk, the new one would have 36.”

Hadoop 2.0 is more than about improving its scale, however. Baldeschwieler would like to see programmers and data scientists able to work with more than MapReduce, in essence making it more ‘pluggable.’ He would also like to see new varieties of files introduced to Hadoop through version 2.0.

Making 2.0 more pluggable may also solve another Hadoop problem businesses are having. Baldeschwieler mentioned that every Fortune 500 company has Hadoop running in some form but many businesses are slow to make full use of it. Making Hadoop more pluggable will not help the businesses that hear of Hadoop, want to get into big data, and end up buying several nodes to accomplish that end without much thought.

 It will however assist those with competent technology departments that have analytics tools but are unable to integrate them with Hadoop for whatever reason. “We need to make sure that there’s the right APIs for everyone who’s building data products to plug into Hadoop in various ways.”

Finally, someone has to be doing all this research into the advancement of Hadoop into its second version. Baldeschwieler notes that while the Hadoop community welcomes good ideas and contributions, one should build a reputation in the community by doing interesting research with Hadoop before trying to add to it.