Follow Datanami:
December 5, 2017

Hadoop 3.0 Likely to Arrive Before Christmas

It’s looking like big data developers will get an early holiday present as work on Hadoop version 3.0 nears completion. And while Hadoop 3.0 brings compelling new features, including a 50% increase in capacity and upwards of a 4x improvement in scalability, more exciting stuff – like support for Docker, support for GPUs, and an S3-compatible storage API —  is slated for versions 3.1 and 3.2 next year.

After years of work, the Apache Hadoop community is now putting the finishing touches on a release candidate for Hadoop 3.0 and, barring any unforeseen occurrences, will deliver it by the middle of December, according to Vinod Kumar Vavilapalli, a committer on the Apache Hadoop project and director of engineering at Hortonworks.

“We can’t set the dates in stone, but it’s looking like we’ll get something out by mid-December,” Vavilapalli told Datanami in an interview last week.

The current plans for Hadoop 3.0 haven’t changed dramatically since the last time we covered the open source project, back in May, when we talked with Andrew Wang, the Hadoop 3 release manager for Apache Hadoop and an engineer at Cloudera.

Here’s a rundown on what’s coming in Hadoop 3.0:

Erasure Coding

Erasure coding is a data protection method that up to this point has mostly been found in object stores. With erasure coding in place, Hadoop 3.0 will no longer default to storing three full copies of each piece of data across its clusters. Instead it will use a data striping method that’s similar in some ways to RAID 5 or 6.

Instead of that 3x hit on storage, the erasure encoding method in Hadoop 3.0 will incur an overhead of 1.5x while maintaining the same level of data recoverability from disk failure.

“HDFS has always been a system based on replication,” Vavilapalli said. “Now that organizations have put a lot of data in their big data clusters… it’s time to think back and look at how to efficiently manage the storage and get more out of the Hadoop cluster. That’s where erasure coding comes in.”

YARN Federation

Hadoop 3 will include new features in the YARN resource manager that will open the door to customers running single clusters with tens of thousands of nodes – and possibly even to hundreds of thousands of nodes.

“YARN was originally designed to scale to 10,000 machines,” Vavilapalli said. “Our friends at Microsoft have been contributing this feature called YARN Federation that will scale YARN to multiple tens of thousands of machines.”

Conceptually, YARN Federation works similarly to HDFS Federation, a major Hadoop 2.x delivery that has multiple namenodes working in concert. With YARN Federation, each sub-cell would be responsible for a group of machines while allowing multiple sub-cells to work together to build massive clusters in a single namespace.

YARN Federation should allow Hadoop clusters to scale to 40,000 nodes without too much trouble, and even allow it to scale beyond 100,000 nodes, Vavilapalli says.

Resource Types

Hadoop 3.0 will bring an extensible new framework to YARN that lets it mange additional resource types beyond memory and CPU. This will provide the basis for supporting GPUs in Hadoop clusters with version 3.1 and FPGAs ostensibly in version 3.2.

It will also allow YARN to control another important resource in a big data cluster that up until now has not been directly controllable: disk.

Java 8

Apache Hadoop 2 and many extended members of the Hadoop family currently run on version 7 of the Java Developers Kit (JDK). With support for JDK7 waning, and JDK8 being the optimal route forward, the folks running the project are making the call to enforce a switch to JDK8 starting with Apache Hadoop 3.0.

The good news is the wider big data community is working together to ensure that other animals on the big data farm are also moving up to JDK8. The releases of Hadoop 3.0, HBase 2.0, Hive 3.0, and Phoenix 3.0 will coincide, more or less, in time and in Java.

“JDK is one of the things that does tie all these communities together, because at the end of the day, you’re running all the software on the same cluster, so running different pieces on different JDKs is going to be a problem,” Vavilapalli said.

After Hadoop 3.0 becomes generally available this month, there will be new releases of HBase, Hive, and Phoenix, he said. “So the first half of next year is when all these things will come together,” he said. “It’s not just about Hadoop 3.0.”

Accelerated Release Cycle

Work is already underway on Hadoop 3.1 and Hadoop 3.2, and the plan calls for delivering these releases three months apart, Vavilapalli said. The Apache Hadoop community will be accelerating its release cycle to get more capabilities into the hands of its users more quickly, he said.

The releases “have been spaced out too much.  Every six months you used to get a new release,” he said. “In 3.0 we’ve proposed…the community generally moving to faster release.  So the community is moving in that direction.”

Vavilapalli also discussed features slated for Hadoop 3.1 and Hadoop 3.2, which we’ll cover in a future article.

Related Stories

Committers Talk Hadoop 3 at Apache Big Data

Hadoop 3 Poised to Boost Storage Capacity, Resilience with Erasure Coding