Follow Datanami:
August 12, 2015

Google Releases Dataflow, Announces Partners

Google is taking the wraps off its Dataflow hosted cloud service while announcing a batch of partnerships and third-party developers as part of an effort to reduce the operational hurdles associated with traditional data analytics systems.

In announcing general availability of Dataflow, its big data pipeline model launched in June 2014, Google revealed four new Dataflow service integrators: Clear Story, Salesforce, SpringML and Tamr. It also announced software development kit (SDK) runners from DataArtisans and Cloudera. The latter announced in January it would team with Google to run Dataflow on Apache Spark.

Google Cloud Dataflow is a managed service for creating data pipelines that ingest, transform and analyze massive amounts of data in either batch or streaming modes, using the same SDK and API. The service is based internal Google technologies like FlumeJava and MillWheel, and was introduced by Google last year as a successor to MapReduce.

Dataflow is designed “to remove the complexity of developing separate systems for batch and streaming data sources by providing a unified programming model,” the company said in an Aug. 12 blog post announcing general availability of Dataflow. It also unveiled a real-time messaging service called Pub/Sub used to connect services to each other along with other Google APIs.

Dataflow also aims to reduce operational overhead related to large-scale cluster management and optimization, Google said.

Google’s Cloud Dataflow architecture combines batch and stream processing.

Tamr, Cambridge, Mass., said its data preparation integration with Dataflow is designed to allow analysts to independently gather and format new datasets drawn from on-premise and cloud infrastructure. The company said its tool would use Dataflow to help analysts publish massive datasets.

The combination is designed to simplify “how people access and use crucial data and distributed computing assets in the enterprise,” Andy Palmer, Tamr co-founder and CEO, added in a statement.

Salesforce said it would combine Dataflow with its Wave Analytics platform designed to sift through massive amounts of customer data. Running on the cloud service, Salesforce said the platform would allow users to analyze data using a variety of devices.

Dataflow integrator SpringML said its library of offerings would use Dataflow for large data processing, ETL and visualization for Salesforce’s Wave Analytics.

ClearStory’s data-blending capabilities would complement Dataflow by incorporating automated semantic profiling of diverse data. That approach is used to simplify visualization for analysts and business users.

Meanwhile, Google said the Dataflow programming model will now run on Cloudera’s Spark distribution, allowing the same Dataflow program to execute on a Spark cluster in the cloud or on-premises.

Along with Apache Spark, Google said it would continue to support an alternative runner for streaming data via Apache Flink, which is billed by DataArtisan as a scalable stream-processing engine.

Google further claimed that cloud-based Dataflow runs up to three times faster than Hadoop when evaluating MapReduce-based data pipelines while also boosting resource utilization.

Recent items:

Cloudera Teams With Google to Run Dataflow on Spark

Google Reimagines MapReduce, Launches Dataflow