October 23, 2013

HDP 2.0: Rise of the Hadoop Data Lake

Alex Woodie

Hortonworks became the first Hadoop distributor to ship the new Hadoop version 2 software today when it announced the general availability of Hortonworks Data Platform (HDP) 2.0. The update will enable customers with small Hadoop clusters to upgrade their big data platform into a shared Hadoop service, or a data lake, a Hortonworks executive explains.

The much-awaited introduction of YARN in Hadoop version 2 will make it easier for customer to run batch, interactive, streaming, and real-time workloads on their Hadoop clusters, like one big happy family. YARN acts as a de facto operating system to tame the various Hadoop engines, such as MapReduce, HBase, Hive, and Storm, and to make sure they play well together under the same tent. That is arguably the big news with Hadoop version 2, which has been in beta for the better part of 2013 and was released in final form just last week by the Apache Software Foundation.

With YARN directing activities under the Hadoop big top, organizations can now feel comfortable scaling their clusters to new heights without having to worry about resource contention issues among the various engines, explains Hortonworks vice president of corporate strategy Shaun Connolly.

About 70 percent of Hortonworks customers begin their work on Hadoop with very targeted goals, usually centered around just a handful of data types, Connolly says. These projects start off with Hadoop clusters that are from 10 to 40 nodes in size.

“At a certain point, they’re convinced they have enough of these new Hadoop applications that are driving value that they want to centralize it and operationalize it and that’s where they graduate,” Connolly says in an interview with Datanami. “They’ve grown Hadoop to have enough applications and data at their disposal, and they want to treat Hadoop as a shared data lake, or a Hadoop service, if you will, where that cluster will begin to grow into pretty sizable scale.”

“The power of Hadoop 2 and YARN is in that data lake service, where there are varying applications and workloads hitting that service,” Connolly says. “It’s usually around 100 nodes and above. That’s where they bring in full-blown IT operations. That’s really where the power of YARN shines because now you’re running mixed workloads in the platform.”

YARN will help in two ways, he says. There are the obvious benefits of getting MapReduce, HBase, Storm, and Giraph to play nicely. But there are also straight-up performance improvements hidden away in the Hadoop version 2 code that will provide a sizable speed-up to all workloads.

“Even if just you’re just doing classic MapReduce, you get twice the performance, and you’re able to run twice as many jobs as the previous Hadoop 1.x lot,” Connolly says. “Massive clusters won’t have to grow as quickly because you have more headroom and you have faster performance.”

The folks at Hortonworks had a big hand in getting Hadoop version 2 out the door. That’s because many of the managers of the various open source Apache projects that make up Hadoop are employed as engineers at the Palo Alto, California, software company. This gives Hortonworks an edge, and helps ensure that the long trunk of the open source project is ready for prime time at real companies.

“Effectively we’re the ones who drive a lot of the code in those [Apache] projects,” Connolly contends. “We have over 100 engineers, and almost all of them have committer access, to be able to modify and ship code for all these ASF projects. We do all our work in those projects. We patch and stabilize those products, so when those products come up in stable GA release, we have the confidence of knowing we can know package that up into a Hortonworks Data Platform release that’s stable for the enterprise.”

Hortonworks engineers participated in much of the work of finalizing the various Apache projects that make up Hadoop. To that end, HDP 2.0 includes the following components.

• Apache Hadoop 2.2.0 (GA from the community on 10/15)

• Apache HBase 0.96 (GA from the community on 10/18)

• Apache ZooKeeper 3.4.5

• Apache Pig 0.12.0 (GA from the community on 10/??)

• Apache Hive 0.12.0 (GA from the community on 10/16)

• Apache HCatalog 0.12.0 (GA from the community on 10/16)

• Apache Oozie 4.0.0

• Apache Sqoop 1.4.4

• Apache Flume 1.4.0

• Apache Ambari 1.4.1 (GA from the community on 10/??)

• Apache Mahout 0.8.0

In addition to acting as an intermediary between the open source world and enterprise customers, Hortonworks works with other software vendors (like Microsoft, SAP, Teradata, SAS, Splunk, and Talend) to enable their products to work with Hadoop. To that end, you can expect several announcements from Hortonworks at the Strata Conference + Hadoop World 2013 conference being held next week in New York City.

Hortonworks is also unique among the Hadoop distributors in that it’s the only one supporting Windows. With Hadoop version 2, Windows becomes fully supported as an alternative to Linux, without having to do any kludge emulations or workarounds. That’s good news for Hortonworks, which has worked with Microsoft to enable Hadoop to run as a service on its Azure cloud.

Hortonworks’ Windows Hadoop offering has been on the market for six months, and accounts for about 15 percent of its business, Connolly says. “It’s made pretty good inroads,” he says. Hortonworks and Microsoft will be discussing Microsoft’s Reef machine learning technology at the upcoming Strata Hadoop Summit next week

Going forward, the engineers at Hortonworks are working on Apache Tez and the Stinger project. Tez, which is an Apache incubator project, is aimed at expand upon the MapReduce paradigm to enable more interactive-like response times. Stinger, meanwhile, aims at spending up the processing of SQL in Apache Hive. Phase two of Stinger completed with the introduction of YARN in HDP 2.0. Phase three aims to enable Hive on Apache Tez, among other enhancements, and is “coming soon,” according to Hortonworks.

The Big Data Market By the Numbers

YARN to Spin Hadoop into Big Data Operating System

Applications: Enterprise Analytics

Technologies: Middleware

Sectors: Retail

Vendors: Hortonworks

Tags: Hadoop, HBase, Hive, mapreduce, sql, storm, tez, yarn

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

HDP 2.0: Rise of the Hadoop Data Lake

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

HDP 2.0: Rise of the Hadoop Data Lake

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link