Follow Datanami:
January 30, 2019

Google Brings Kubernetes Operator for Spark to GCP

Those looking to run Apache Spark on clusters managed with Kubernetes will be interested in the new Spark operator for Kubernetes unveiled by Google today. The software, which is in beta, will be supported on the Google Cloud Platform.

Kubernetes is emerging as a defacto standard for resource scheduling using containers on the cloud and on-premise too. The software, which was developed by Google and released as open source, streamlines much of the technical difficulty in managing and scaling microservices-based applications across underlying hardware resources.

Apache Spark, meanwhile, is the go-to framework that data scientists and data engineers use to develop and run data-intensive distributed applications, including ETL, machine learning, stream processing, and SQL analytics (among other use cases). The in-memory technology was originally developed at Cal Berkeley’s AMPLab to replace MapReduce in the Hadoop stack, but its versatility has bolstered its usage outside of Hadoop.

In a post on the Google Cloud Blog today, Google product manager Palak Bhatia and software engineer Yinan Li describe the rationale for the Spark Operator and how it’s being implemented at a technical level.

“Traditionally, large-scale data processing workloads—Spark jobs included—run on dedicated software stacks such as Yarn or Mesos,” Bhatia and Li write in the post. “With the rising popularity of microservices and containers, organizations have demonstrated a need for first-class support for data processing and machine learning workloads in Kubernetes.”

Kubernetes integration with Spark has been a focus of developers working on the Apache Spark project. The first delivery of that integration occurred with Spark 2.3.0 last April, and the Kubernetes Scheduler was improved with Spark 2.4.0 last November.

Google’s new Spark Operator relies upon this native Kubernetes integration to run, monitor, and manage the lifecycle of Spark applications within a Kubernetes cluster on GCP. While technically a beta, the company says the Spark Operator is “ready for use for large scale data transformation, analytics, and machine learning” on GCP. The operator supports Spark 2.4.0, which supports the running of PySpark and SparkR applications in the Kubernetes cluster.

Bhatia and Li provide more details:

“Specifically, this operator is a Kubernetes custom controller that uses custom resources for declarative specification of Spark applications,” they write. “The controller offers fine-grained lifecycle management of Spark applications, including support for automatic restart using a configurable restart policy, and for running cron-based, scheduled applications.”

The operator allows users to “create a declarative specification that describes your Spark applications and use native Kubernetes tooling such as kubectl to manage your applications,” the Google employees continue. “As a result, you now have a common control plane for managing different kinds of workloads on Kubernetes, simplifying management and improving your cluster’s resource utilization.”

The new operator integrates with other GCP products and services, including Stackdriver for logging and monitoring, with Cloud Storage, and with BigQuery for analytics. The software ships with a custom Dockerfile that supports using Cloud Storage for input or output of an application, as well as the Prometheus JMX exporter for monitoring, the Google employees add.

The Spark Operator is already in use on GCP, according to Google. There’s already a Slack channel dedicated to the operator that has more than 170 members involved with discussions about it. There’s also a GitHub repository where developers share and distribute code related to the project.

Google already has plans to bolster the Spark Operator, including support for different Spark versions (there are incompatibilities between Kubernetes operators used for Spark 2.4 and Spark 2.3.x); the addition of priority queues and priority-based scheduling; Kerberos authentiaton; and improvements to the “sparkctl” command line tool.

The Spark Operator is available now in the GCP Marketplace.

Related Items:

Kubernetes Is a Prime Catalyst in AI and Big Data’s Evolution

Is Hadoop Officially Dead?

Top 3 New Features in Apache Spark 2.3

Datanami