HDP 2.0: Rise of the Hadoop Data Lake
Hortonworks became the first Hadoop distributor to ship the new Hadoop version 2 software today when it announced the general availability of Hortonworks Data Platform (HDP) 2.0. The update will enable customers with small Hadoop clusters to upgrade their big data platform into a shared Hadoop service, or a data lake, a Hortonworks executive explains.
The much-awaited introduction of YARN in Hadoop version 2 will make it easier for customer to run batch, interactive, streaming, and real-time workloads on their Hadoop clusters, like one big happy family. YARN acts as a de facto operating system to tame the various Hadoop engines, such as MapReduce, HBase, Hive, and Storm, and to make sure they play well together under the same tent. That is arguably the big news with Hadoop version 2, which has been in beta for the better part of 2013 and was released in final form just last week by the Apache Software Foundation.
With YARN directing activities under the Hadoop big top, organizations can now feel comfortable scaling their clusters to new heights without having to worry about resource contention issues among the various engines, explains Hortonworks vice president of corporate strategy Shaun Connolly.
About 70 percent of Hortonworks customers begin their work on Hadoop with very targeted goals, usually centered around just a handful of data types, Connolly says. These projects start off with Hadoop clusters that are from 10 to 40 nodes in size.
“At a certain point, they’re convinced they have enough of these new Hadoop applications that are driving value that they want to centralize it and operationalize it and that’s where they graduate,” Connolly says in an interview with Datanami. “They’ve grown Hadoop to have enough applications and data at their disposal, and they want to treat Hadoop as a shared data lake, or a Hadoop service, if you will, where that cluster will begin to grow into pretty sizable scale.”
“The power of Hadoop 2 and YARN is in that data lake service, where there are varying applications and workloads hitting that service,” Connolly says. “It’s usually around 100 nodes and above. That’s where they bring in full-blown IT operations. That’s really where the power of YARN shines because now you’re running mixed workloads in the platform.”
YARN will help in two ways, he says. There are the obvious benefits of getting MapReduce, HBase, Storm, and Giraph to play nicely. But there are also straight-up performance improvements hidden away in the Hadoop version 2 code that will provide a sizable speed-up to all workloads.
“Even if just you’re just doing classic MapReduce, you get twice the performance, and you’re able to run twice as many jobs as the previous Hadoop 1.x lot,” Connolly says. “Massive clusters won’t have to grow as quickly because you have more headroom and you have faster performance.”
The folks at Hortonworks had a big hand in getting Hadoop version 2 out the door. That’s because many of the managers of the various open source Apache projects that make up Hadoop are employed as engineers at the Palo Alto, California, software company. This gives Hortonworks an edge, and helps ensure that the long trunk of the open source project is ready for prime time at real companies.
“Effectively we’re the ones who drive a lot of the code in those [Apache] projects,” Connolly contends. “We have over 100 engineers, and almost all of them have committer access, to be able to modify and ship code for all these ASF projects. We do all our work in those projects. We patch and stabilize those products, so when those products come up in stable GA release, we have the confidence of knowing we can know package that up into a Hortonworks Data Platform release that’s stable for the enterprise.”
Hortonworks engineers participated in much of the work of finalizing the various Apache projects that make up Hadoop. To that end, HDP 2.0 includes the following components.
• Apache Hadoop 2.2.0 (GA from the community on 10/15)
• Apache HBase 0.96 (GA from the community on 10/18)
• Apache ZooKeeper 3.4.5
• Apache Pig 0.12.0 (GA from the community on 10/??)
• Apache Hive 0.12.0 (GA from the community on 10/16)
• Apache HCatalog 0.12.0 (GA from the community on 10/16)
• Apache Oozie 4.0.0
• Apache Sqoop 1.4.4
• Apache Flume 1.4.0
• Apache Ambari 1.4.1 (GA from the community on 10/??)
• Apache Mahout 0.8.0
In addition to acting as an intermediary between the open source world and enterprise customers, Hortonworks works with other software vendors (like Microsoft, SAP, Teradata, SAS, Splunk, and Talend) to enable their products to work with Hadoop. To that end, you can expect several announcements from Hortonworks at the Strata Conference + Hadoop World 2013 conference being held next week in New York City.
Hortonworks is also unique among the Hadoop distributors in that it’s the only one supporting Windows. With Hadoop version 2, Windows becomes fully supported as an alternative to Linux, without having to do any kludge emulations or workarounds. That’s good news for Hortonworks, which has worked with Microsoft to enable Hadoop to run as a service on its Azure cloud.
Hortonworks’ Windows Hadoop offering has been on the market for six months, and accounts for about 15 percent of its business, Connolly says. “It’s made pretty good inroads,” he says. Hortonworks and Microsoft will be discussing Microsoft’s Reef machine learning technology at the upcoming Strata Hadoop Summit next week
Going forward, the engineers at Hortonworks are working on Apache Tez and the Stinger project. Tez, which is an Apache incubator project, is aimed at expand upon the MapReduce paradigm to enable more interactive-like response times. Stinger, meanwhile, aims at spending up the processing of SQL in Apache Hive. Phase two of Stinger completed with the introduction of YARN in HDP 2.0. Phase three aims to enable Hive on Apache Tez, among other enhancements, and is “coming soon,” according to Hortonworks.