January 20, 2015

Cloudera Teams with Google to Run Dataflow on Spark

Alex Woodie

Cloudera and Google today announced that they’re working together to get Dataflow–the big data pipeline model Google publicly launched last June–to run on Apache Spark, thereby giving customers more freedom to run their big data applications wherever they see fit.

Google Cloud Dataflow is a managed service for creating data pipelines that ingest, transform, and analyze massive amounts of data, in either batch or streaming modes, using the same SDK and API. The service is based internal Google technologies like FlumeJava and MillWheel, and was introduced by Google last year as a successor to MapReduce that’s well-suited for massive ETL jobs.

Dataflow uses the concept of “runners” to determine where a given Dataflow program runs. It started off with two, including the Google Cloud Dataflow runner that executes on the Google Cloud Platform; and a Direct Pipeline runner, which executes the program on the developer’s local machine.

With today’s announcement, Google is now supporting a third runner, for Spark. The software, which was developed by Cloudera Labs, allows the same Dataflow pipeline that a developer ran on his local PC or the Google Cloud Platform to run on a Spark cluster, either on-premise or in the cloud.

Supporting Dataflow on Spark is all about giving customers freedom of choice when it comes to the back-end platform used for big data processing. It’s about not locking them into a specific architecture and avoiding having to rewrite algorithms, according to Google Cloud Platform product manager William Vambenepe.

“Big data processing can take place in many contexts,” Vambenepe writes in the Google Cloud Platform blog. “Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost.

“The best deployment option,” he continues, “often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs–from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.”

At this point, the Spark runner for Dataflow only supports batch pipelines. Cloudera is working on extending Spark Streaming to support all the windowing functionality provided by Dataflow, says Josh Wills, Cloudera’s senior director of data science, in a separate blog post today. It supports Apache Spark version 1.2, which Cloudera supports in CDH 5.3.

“We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem,” Vambenepe says. “We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises.”

You can access the early prototype of the Spark Dataflow runner at github.com/cloudera/spark-dataflow.

Google Re-Imagines MapReduce, Launches DataFlow

Google Bypasses HDFS with New Cloud Storage Option

Applications: Data Mining

Technologies: Frameworks, Middleware

Sectors: Financial Services, Retail

Vendors: Cloudera, google

Tags: apache spark, big data pipeline, Google Cloud Dataflow

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Cloudera Teams with Google to Run Dataflow on Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Cloudera Teams with Google to Run Dataflow on Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link