Cloudera Prefers Red Hat for Kubernetes, But YARN Not Going Away
When Cloudera ships the on-premise version of its latest Hadoop distribution later this year, it will work with a Kubernetes container orchestration system from Red Hat, the company announced today. But the introduction of Kubernetes in CDP Private Cloud doesn’t mean that YARN will completely disappear, the company says.
Cloudera began transitioning its Hadoop distribution to a new cloud architecture last fall with the introduction of the Cloudera Data Platform (CDP), which combined elements of Hortonworks’ and Cloudera’s older platforms with a public cloud delivery model.
The first iterations of CDP applications, including Cloudera Data Warehouse and Cloudera Machine Learning, sported a Kubernetes container manager in place of the traditional YARN resource scheduler, as well as Amazon Web Services’ S3-based object storage in place of the Hadoop Distributed File System (HDFS). It followed that up with similar CDP offerings that utilize object storage on Microsoft Azure, which it released in March; support for Google Cloud is coming next.
At the same time, it released a YARN-based version of CDP for the public cloud. Called Cloudera Data Hub, the service is designed to run traditional MapReduce and Spark applications on AWS and Azure.
The company has talked about its transition from traditional Hadoop components like YARN and HDFS to the new cloud architecture, featuring Kubernetes and S3 object storage, in the past. Such an architectural shift is a big engineering undertaking and much consideration must be paid to making sure that existing Hadoop applications and open source engines in the Hadoop ecosystem continue working.
Hiding the complexity of the underlying technology has been a big goal of Cloudera and its chief product officer, Arun Murthy, especially in light of the difficulty that customers have experienced with Hadoop in the past. With CDP running in the cloud, Cloudera is able to hide much of that underlying complexity.
With the selection of Red Hat’s OpenShift platform, the company has formally stated its preference with regard to its Kubernetes strategy (although that doesn’t rule out other Kubernetes offerings from being used). OpenShift is a solid market contender that will bolster Cloudera’s hybrid data management position, Murthy says.
“Red Hat OpenShift’s position as the market-leading Kubernetes container platform, combined with its 100% open source nature, make it ideal for CDP Private Cloud,” Murthy said in a press release. “CDP Private Cloud, supported by Red Hat OpenShift, creates an enterprise data cloud with a powerful hybrid architecture that separates compute and storage for greater agility, ease of use, and more efficient use of private and public cloud infrastructure.”
Kubernetes has quickly become the defacto standard way to manage big server workloads, both on the cloud and off. The capability to quickly spin up and spin down containers – usually Docker containers – in a virtual manner, without worrying about the underlying hardware, gives Kubernetes a major advantage over the virtual machine approaches of yesteryear.
All of the major public cloud providers support Kubernetes, which was originally developed by Google. With Kubernetes providing the virtualization layer between the application workloads and the underlying X86 hardware, it grossly simplifies the ability of customers to scale their workloads up and down, or to move them to different computing resources.
But the introduction of Kubernetes doesn’t spell the end of YARN, which debuted in 2014 with the launch of Apache Hadoop 2.0. According to Cloudera, YARN will continue to be used to connect big data workloads to underlying compute resources in CDP Data Center edition, as well as the forthcoming CDP Private Cloud offering, which is now slated to ship in the second half of 2020.
“We are using both YARN and OpenShift Kubernetes, for different parts of Private Cloud,” the company tells Datanami. “Resource management in the private cloud will use aspects of YARN that we have built as applied to Kubernetes resource management.”
CDP Data Center is foundational to CDP Private Cloud, and so the two offerings will share the underlying YARN technology to manage workloads. In the past, Cloudera executives have stated that while support for Kubernetes is important, YARN actually offers better control for some types of workloads than Kubernetes, and so it will be retained in the new data platform.
Cloudera is also working on a third project, called YuniKorn, that bridges the gap between the two resource scheduler. YuniKorn “is a lightweight, universal resource scheduler for container orchestrator systems,” the company stated in a July 2019 blog post. “It is created to achieve fine-grained resource sharing for various workloads efficiently on large scale, multi-tenant environments on one hand and dynamically brought up cloud-native environment on the other.”
The other major architectural shift that’s ongoing at Cloudera involves storage. The public cloud versions of CDP use object stores, including S3 on AWS and Azure Data Lake Storage on Azure. While there was some speculation that Cloudera might move away from HDFS with CDP Private Cloud in favor of an S3-compatbile object storage system, that doesn’t appear to be the case, at least for the time being.
Cloudera tells us that “CDP Data Center, which runs on-prem, continues to support HDFS and in 7.1, we will be introducing Ozone in beta form. CDP Private Cloud also supports data access via HDFS and Ozone.”
Ozone is an alternative file system that Hortonworks began work on in 2014 to solve HDFS’s small-file problem. According to Cloudera, it will behave like an object store for CDP. “Ozone is a distributed key-value store that can manage both small and large files alike,” Cloudera says in a October 2018 blog post. “While HDFS provides POSIX-like semantics, Ozone looks and behaves like an Object Store.”
Cloudera’s offerings have evolved quite bit since the early days of Hadoop. As it looks forward to the future, it’s embracing Kubernetes and S3, modern architectural components that the market has chosen. But at the same time, it’s maintaining connections to the past, which will be reassuring to customers who may not like so much change. It might be a tough strategy to maintain feet in both camps five years from now, but today it looks like a pragmatic one – assuming the user can be shielded from that old nemesis: technical complexity.