Cloudera Teams with Google to Run Dataflow on Spark
Cloudera and Google today announced that they’re working together to get Dataflow–the big data pipeline model Google publicly launched last June–to run on Apache Spark, thereby giving customers more freedom to run their big data applications wherever they see fit.
Google Cloud Dataflow is a managed service for creating data pipelines that ingest, transform, and analyze massive amounts of data, in either batch or streaming modes, using the same SDK and API. The service is based internal Google technologies like FlumeJava and MillWheel, and was introduced by Google last year as a successor to MapReduce that’s well-suited for massive ETL jobs.
Dataflow uses the concept of “runners” to determine where a given Dataflow program runs. It started off with two, including the Google Cloud Dataflow runner that executes on the Google Cloud Platform; and a Direct Pipeline runner, which executes the program on the developer’s local machine.
With today’s announcement, Google is now supporting a third runner, for Spark. The software, which was developed by Cloudera Labs, allows the same Dataflow pipeline that a developer ran on his local PC or the Google Cloud Platform to run on a Spark cluster, either on-premise or in the cloud.
Supporting Dataflow on Spark is all about giving customers freedom of choice when it comes to the back-end platform used for big data processing. It’s about not locking them into a specific architecture and avoiding having to rewrite algorithms, according to Google Cloud Platform product manager William Vambenepe.
“Big data processing can take place in many contexts,” Vambenepe writes in the Google Cloud Platform blog. “Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost.
“The best deployment option,” he continues, “often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs–from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.”
At this point, the Spark runner for Dataflow only supports batch pipelines. Cloudera is working on extending Spark Streaming to support all the windowing functionality provided by Dataflow, says Josh Wills, Cloudera’s senior director of data science, in a separate blog post today. It supports Apache Spark version 1.2, which Cloudera supports in CDH 5.3.
“We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem,” Vambenepe says. “We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises.”
You can access the early prototype of the Spark Dataflow runner at github.com/cloudera/spark-dataflow.