The Key Tech Enabling Cloudera’s New Lakehouse
Cloudera today debuted CDP One, its new software-as-a-service (SaaS) lakehouse offering. For the first time, Cloudera is taking over management of its data platform on behalf of its customers. It’s also Cloudera’s first official foray into the world of data lakehouses, and it’s enabled by support for one key piece of technology.
It’s been nearly three years since Cloudera launched its Cloudera Data Platform (CDP), which marked the company’s transition away from its past as a Hadoop distributor and toward its future as a provider of cloud-based data platforms as a service (PaaS).
As an amalgamation of the Cloudera and Hortonworks Hadoop distributions, CDP bore a lot of resemblance to the Hadoop suites of the past. Data processing engines like Hive, Impala, Spark, and MapReduce were still there. But CDP gave users the option to use newer components that were gaining traction in the public clouds, like Kubernetes instead of YARN for the scheduling component, and S3 instead of HDFS for the storage layer.
With CDP One, Cloudera is now taking the final step of delivering its system as a managed service in the cloud, which will simplify day-to-day management of the platform, according Cloudera CTO Ram Venkatesh.
“What we had in place for over two years was a PaaS offering, not SaaS,” Venkatesh says. “Cloudera used to operate the control plane, but the actual workloads ran in customers’ account. Now with SaaS, everything is on the Cloudera side of the house and for the customer it’s zero ops, completely managed by Cloudera.”
As far as lakehouse goes, it’s a been of a branding move on Clouera’s part. While Cloudera’s competitor, Databricks, popularized the term, it has since been adopted by many other cloud platform providers (including AWS, Google Cloud, and Snowflake) to signify the unification of a data lake and a data warehouse for the purpose of running analytics.
“We’re an open-source company, so we’ll adopt innovation wherever we see it,” Venkatesh tells Datanami regarding the lakehouse concept. “It’s a very good way to frame it in terms that our customers can understand.”
Venkatesh argues that, with the introduction of Apache Hive back in 2012, Cloudera was actually the first vendor with a lakehouse offering Venkatesh says. Exabytes of data still sit in lakehouses organized by Hive, which is supported by all of the hyperscale’s, he says.
However, at this point in time, the Hive metastore is no longer the ideal logical backing for the modern lakehouse architecture, he says. Other table formats have emerged that overcome the technical limitations of Hive, including Databricks’ own Delta Lake and, more recently, Apache Iceberg.
“The problem was this mapping between a warehouse and a lake was always tightly coupled or biased towards one execution engine,” Venkatesh says. “So when Hive did it, it would work really well for Hive. And Spark, you could sort of do it, if you squinted really hard.
“Now with Spark and Delta Lake it works really well if your whole world is monochromatic Spark,” he continues. “But if you really wanted to interop, what we realized was, there’s a piece in the middle, this glue between the warehouse and the lake, [which] is actually a first-class standalone concept that we’re calling as an open table format.”
The open table format that Cloudera selected is Apache Iceberg. In fact, Cloudera announced support for Iceberg back in June (during Databricks’ annual conference, naturally). Iceberg support is now bult into CDP One, giving customers the ability to query their data wherever it sits with whatever query engine they want to use, without having to worry about losing data, which was a common occurrence when the Hive metastore was in charge of the data.
“With Apache Iceberg, this is the first time that this layer is not a slave to one engine,” Venkatesh says. “So on the top end, Iceberg works with Hive, it works with Spark, it works with Impala, it works with Presto. It works with things that we don’t even support.”
On the bottom end, Iceberg lets CDP customers keep their data in whatever on-disk format they want–whether it’s CSV, Parquet, ORC, or Avro–stored on whatever file system they want, whether it’s HDFS, S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (support for ADLS and GCS is forthcoming).
Iceberg checks all the boxes that Cloudera could want in an open source software product designed to enable enterprise-scale analytics, Venkatesh says. It’s open source, with a vibrant community around it, and it’s not tied to a single vendor. “So how could we not be in that innovation?” he says.
But Iceberg’s ability to support multiple use cases in a lakehouse pattern–and above all, its seamless support for multiple data engines–is really what sealed the deal for Cloudera to throw its weight behind it and include it as a feature in its Shared Data Experience (SDX) layer.
“We do really well when customers have to run more than one kind of analytic on a data set,” the CTO says. “Typically, if they have a single use case, a single data set, or its only SQL, then we may not be the best fit for them. But if they have a lot of data prep, if they have real time and batch data, if they have SQL, if they have some machine learning, if they have some time series analytics, if they have some currency analytics–and this is what large enterprise data platforms look like–they’re combining data in ways that you never thought about when the data was actually originated or sourced.
“When customers are doing this multi-functional analytics, then the seams between these engines become very apparent,” he continues. “Hive, Impala and Spark did not work very cohesively together in the way they were expecting. This was an actual pain point for our customers. Now with Iceberg, they see us embracing this layer to be open.”
The other advantage that Cloudera hopes to exploit going forward is its ability to run on-prem. The Santa Clara, California vendor touts its ability to run a lakehouse on-prem, in the public cloud, or via the SaaS delivery method gives it an advantage over its competitors that are strictly in the cloud.
“It’s critical,” Venkatesh says. “For our customers, it’s never one size fits all. Even Amazon in their own studies they say cloud is really getting a lot of adoption [and that] by 2025 half of the world’s data is going to be in public cloud. That’s a great story. I love that story. But what about the other half?”
Many customers will not run their lakehouses in the cloud, according to Venkatesh. Whether it’s an issue with scalability, geography, or regulations, there are enterprise accounts that will need to keep their data on prem.
“We are uniquely positioned with this flexibility, which we think is the one super power Clouded has,” he says. “We are hybrid when that’s what customers want.”