Google Charts Data Stack Overhaul with Kubernetes
Google today announced that its hosted Spark and Hadoop distribution, Cloud Dataproc, can now run on Kubernetes, at least as an alpha. It may not sound particularly groundbreaking. But when you consider the work that went into replacing YARN with Kubernetes on Cloud Datarpoc — as well as Google’s commitment to bringing other components of the former Hadoop stack forward into the Kubernetes realm – then you realize what a sizable commitment Google is making to modernize the whole big data stack. But will the community follow?
Up until today, Cloud Dataproc has run like every other Hadoop distribution, whether in-cloud or on-prem. That is, Cloud Dataproc has relied on YARN to be the underlying workload and cluster manager. All of the data processing engines that Google makes available in the service – from Spark and MapReduce to Pig and Flink — have relied on YARN to dole out computing resources and prevent workloads from colliding, just like other cloud and on-prem Hadoop clusters.
That Google customer are dependent on YARN may be surprising, especially with all the hype surrounding Kubernetes, which nearly everybody seems to agree is the heir-in-waiting for YARN. After all, Kubernetes was developed at Google, and was modeled on the software the Web giant uses to keep workloads on the straight and narrow in its own data centers.
You might be tempted to think that Kubernetes has already infiltrated Google Cloud and that all of the cloud services that Google exposes can run in the blissfully atomic, eternally scalable manner that Kubernetes promises. But you would be mistaken. It wasn’t until early 2019 that Google announced a Kubernetes operator for Spark. That allowed customers to run Spark on Google Cloud Platform clusters that are managed using Kubernetes. But Cloud Dataproc is a different beast.
Like Cloudera‘s Hadoop distributions, Cloud Dataproc is assembled from various open source components and bundled together with Apache BigTop. Customers are free to run just about anything on GCP. But if they want a big data platform that brings centralized security, logging, service level agreements (SLAs), and a nice management console, then they’re going to be looking at a managed series like Cloud Dataproc or other Hadoop offerings, such as Amazon Web Service‘s Elastic MapReduce (EMR) or Microsoft Azure‘s HDInsight.
Hadoop didn’t die. It just moved to the cloud.
Major Platform Overhaul
Adopting Kuberentes will require a ton of new code to be written and tested at many levels of the big data stack, including the underlying data platform level and the individual-engine level.
But the benefits of moving to Kubernetes will be felt in greater isolation between processing engines, less co-dependency in the Hadoop stack, and the greater scalability.
Google is positioned to help at all levels, including with individual open source projects, says James Malone, Google’s senior product manager for Cloud Dataproc and other cloud offerings.
“It’s sort of a two-front battle here that we’re undertaking to make it happen,” Malone says. “It’s a big change for a managed service, but it also entails an astronomical number of changes in the open source, which we’re actively working on as well. A lot of things need to happen inside of each individual open source project. That’s why we’re trying to get the ball rolling.”
Google employs committers to the Apache Spark project and initiated the development of the Kubernetes for Spark operator, which should become generally available with an upcoming version of Spark. Google also today announced a Kubernetes operator for Flink, and is working with the folks behind Presto and Druid to develop Kubernetes operators for those open source projects too.
“One of the reasons we chose those four is they’re less dependent on other projects so they’re a bit more isolated. They’re a little easier to move to Kubernetes first,” Malone tells Datanami. “Plus they have the most popularity behind them, which is also nice. I think over time, we’ll start looking at, okay how can we get things like Hive running on Kubernetes as well.”
At the platform level, Google just wrapped up the first iteration of support for Kubernetes at Cloud Dataproc. But it’s not a wholesale change, as Google is committed to supporting a YARN version indefinitely, Malone says.
“We’ll still have our YARN version of Dataproc available for the very long foreseeable future,” he says. “What we expect will happen is a lot of customers will start doing more development and testing on Kubnerntes and there will be an inflection point where a lot of new development will occur in Kubernetes.”
In fact, Google is taking care to make sure the YARN and Kubernetes versions of Dataproc are, code-wise and compatibility-wise, as close as possible to enable customers to switch back and forth between the two, Malone says.
A Wider Hybrid World
There’s a third factor at play in Google’s emphasis on getting the big data community to adopt Kubernetes: Anthos, its on-premise runtime. Anthos is based on Kubernetes, and the plan is for a version of Cloud Dataproc to run on Anthos at some point in time.
“Anthos is a really important component of our long-term strategy for hybrid cloud and multi-cloud,” Malone says. “And Anthos is Kubernetes-based. So to the extent we get Dataproc working and all of the open source components working on Dataproc, it also unlocks the future where Dataproc can run on Anthos, which is very important to us, to offer customer that flexibility of where they use our services.”
The way Malone tells it, Google is committed to helping the community by backing open standards that give customers the ability to move freely. That could reduce the lock-in and revenue for Google Cloud, but Malone says the community-wide benefits of reduced dependencies and complexity is well worth it.
“I’m a Spark developer. I take my Spark code and run it on Dataproc or EMR or Cloudera on-prem,” he says. “I’m thinking about what version is all the stuff in the Hadoop stack. What version of operating system am I using? What does the environment look like? How do I tune that cluster to respond to the resource constraint or availability?
“When we move to Kubernetes, it’s much easier because I as a developer can take my code, package it up with whatever dependences I need, and I stop caring about the other things that honestly aren’t really a value-add to me,” he continues. “It really give a lot more flexibility to the customer.”