Hadoop 3.0 Ships, But What Does the Roadmap Reveal?
As promised, the Apache Software Foundation delivered Hadoop version 3.0 before the end of the year. Now the Hadoop community turns its attention to versions 3.1 and 3.2, which are slated to bring even more good stuff during the first half of 2018.
As we told you about last week, Hadoop 3.0 brings two big new features that are compelling in their own right. That includes support for erasure coding, which should boost storage efficiency by 50% thanks to more efficient data replication; and YARN Federation, which should allow Hadoop clusters to scale up to 40,000 nodes.
The delivery of Hadoop 3.0 shows that open open source community is responding to demands of industry, said Doug Cutting, original co-creator of Apache Hadoop and the chief architect at Cloudera.
“It’s tremendous to see this significant progress, from the raw tool of eleven years ago, to the mature software in today’s release,” he said in a press release. “With this milestone, Hadoop better meets the requirements of its growing role in enterprise data systems.
But some of the new features in Hadoop 3.0 weren’t designed to bring immediate rewards to users. Instead, they pave the way for the Apache Hadoop community to deliver more compelling features with versions 3.1 and versions 3.2, according to Hortonworks director of engineering Vinod Kumar Vavilapalli, who’s also a committer on the Apache Hadoop project.
“Hadoop 3.0 is actually a building block, a foundation, for more exciting things to come in 3.1 and 3.2,” he said.
Vavilapalli shared parts of the Hadoop roadmap with Datanami recently. Here are some of the highlights from that conversation.
One of the Hadoop 3.0 features that will pay immediate dividends in version 3.1 is support for resource types in YARN.
With the Hadoop 2.x line, YARN only recognizes two resources: memory and CPU. With resource types delivered in Hadoop 3.0, the community is well positioned to offer support for GPUs in Hadoop clusters with version 3.1.
Having GPUs supported as a first-class resource type in Hadoop will make it easier for customers to run GPU-loving workloads, such as machine learning and deep learning workloads, Vavilapalli said.
“The idea is, instead of asking people to set up separate partitions and separate machines and separate clusters for GPU workloads, YARN itself will have first-class support for GPUs,” Vavilapalli said. “There’s a lot of sharing and multi-tenancy that we’ve already solved with YARN.”
Another worthwhile new feature that developers are working to deliver in Hadoop 3.1 is YARN support for Docker containers. According to Vavilapalli, Docker YARN support will deliver two main capabilities. The first is allowing non-big data workloads to run on Hadoop, such as containerized applications.
“In addition to running MapReduce workloads, Spark workloads, Hive workloads on top of YARN, which have been traditional big data workloads,” the Hortonworks engineer said, “if a user has a containerized application that they’ve already built on their laptop, they can go to a YARN cluster and say, now run a hundred copies of it, a thousand copies if it, and YARN will be able to do it.”
Gaining package isolation is the second advantage that Docker on YARN would bring to Hadoop 3.1. One area where this is important is for ensuring that R and Python libraries are present but don’t cause compatibility problems with each other. “With package isolation, you can actually create your own Docker container which has all your R or Python libraries, hand it off to YARN and YARN will exactly run it the way it runs on your laptop,” Vavilapalli said.
Another major new features slated for Hadoop 3.1 is something called YARN Services. With this new feature, YARN will have new capabilities to manage long-running services on Hadoop, such as an incoming Kafka data stream or an HBase service job.
“The goal of long-running services is you go to YARN, and just like you’re running Spark…and Hive, you can go to YARN and say, ‘Run this long-running workload for me on top of YARN right next to everything else that’s running there,” Vavilapalli said. “We call this YARN Services. So YARN Services is a big feature we’ve been working on.”
The Apache Hadoop community plans to use the new resource type support in Hadoop 3.0 to deliver GPU support in version 3.1. And if all goes as planned, Hadoop 3.2 will get support for FPGAs.
FPGAs, or field programmable gate arrays, are a type of specialized processor designed to speed up a particular type of workloads. Intel, which acquired FPGA maker Altera in 2015, is investing in FPGAs technology to speed up high-performance computing (HPC), AI, data and video analytics, and 5G network processing on Xeon clusters, while IBM has added FPGAs to its Power servers to, among other things, to juice the performance and scalability of graph databases.
There are two reasons Hadoop customers may embrace FPGAs, Vavilapalli said. “One is GPUs are expensive. Some users are looking at FPGAs as a cheaper way of getting some of things that they want,” he said. “And there are use cases that can be done in FPGAs alone. Those are the two use case that are driving support for FPGAs in YARN.”
The Hadoop community is currently working on FPGA support, Vavilapalli said, and there’s a chance that it could come with Hadoop version 3.1. “It’s not entirely clear whether it will land in 3.1 or 3.2,” he said. “But FPGA support in YARN is coming soon.”
The other big new feature in Hadoop 3.2 is support for a new key value store called Ozone. According to Vavilapalli, Ozone will bring an S3-compatible API storage and will be optimized for storing smaller files that are not a good fit for HDFS.
HDFS was originally designed to store relatively large files in a write-once, read multiple times manner, Vavilappalli said. “Over time, we are seeing additional workloads which basically say ‘I want to save this little object.’ Maybe it’s a photo or a small video. These are not your typical Hadoop workloads. So there’s a need for storing a lot of small objects, and people end up using HDFS for that.”
Ozone would be better suited for storing these types of objects. The storage repository will feature a storage API that’s compatible with S3, the Simple Storage Service developed by Amazon Web Services to store zettabytes’ worth of data across millions of servers running in hundreds of data centers around the world.
As AWS has taken off, the S3 API has become a defacto storage standard that’s supported by any software as a service (SaaS) vendors that want to tap into data their customers have stored in S3, including Hadoop and Hadoop-like workloads running in the cloud. “The community is trying to make it look like S3 API because the S3 API is a standard for all kinds of key value stores,” Vavilapalli. “It will look like S3, but for on-prem Hadoop clusters.”
While cloud-based Hadoop workloads like Amazon’s EMR and Qubole – and even the cloud offerings of Hadoop distributors Cloudera and Hortonworks — tap into S3, Ozone is designed to work both in the cloud and on premise.
“Cloud is a major force today. But eventually we will have cloud and we’ll have an on-prem footprint,” Vavilapalli said. “To be able to run applications that seamlessly work on-prem as well as cloud, you need something like that on prem. That’s where Ozone will fill a major gap… Having these applications write to a key value API and make it run on prem as well as cloud is a very powerful primitive that we can see people using in the future.”
While the delivery date could change, Hadoop version 3.2 is being penciled in for delivery around the end of the second quarter, Vavilapalli said. The Apache Hadoop community is trying to accelerate the release cadence for the project, and wants to deliver a major upgrade every three months, he said.