Follow Datanami:
April 16, 2015

Google Cloud Dataflow Now Open for Business

Google today formally took the wraps off Cloud Dataflow, the hosted offering designed to allow developers with average Java and Python skills to build sophisticated analytic “pipelines” that process huge amounts of data.

Google introduced Cloud Dataflow about a year ago as a next-gen platform for building systems that can ingest, transform, normalize, and analyze huge amounts of data—well into the exabyte range, Google executives said. The software is built on the infrastructure and technology that powers Google’s own applications, including FlumeJava and Millwheel.

The idea behind Dataflow is simple: By concealing the underlying complexity of a big data setup behind straightforward SDKs and APIs—and offloading the infrastructure elements to the Google Cloud–big data analytics can be put within the reach of mere mortals, as opposed to confining it to the realm of data scientist superheroes with million-dollar pedigrees.

“Up until now, big data has not been accessible to your average developer or your average analyst in an enterprise because it’s just too hard,” says Tom Kershaw, Google’s director of product management.  “Big data is not easy to do. Spinning up Hadoop cluster and programing MapReduces and all of those things are something that only data scientists have been able to do.”

That’s held the market back, Kershaw says. “It’s also created this have-have not divide within the developer community around big data,” he tells Datanami. “So what we wanted to do–and what we’re focused on heavily right now–is making big data accessible to your average developer and your average systems analyst at enterprises. If you’re a developer and you have Java or Python skills, you should be able to write a big data application without having to spend five years studying Hadoop.”

Equipped with the Cloud Dataflow SDK, a developer should be able to create what is essentially a big ETL job that lives in the cloud. That job can tap into various data sources, automatically apply data transformations, and then place the results back into Google’s File System, or make it accessible to queries via a SQL engine such as Impala–or more likely, Google’s own BigQuery. The Dataflow programming model is portable, Google adds, “so customers can also run their pipelines on any Spark or Flink cluster if they choose. “

In addition to unveiling Dataflow as a beta–which is just about the same as a “GA” product at most other software vendors—Google also announced that it’s added streaming analytic capabilities to Dataflow, thereby enabling users to work with historical and real-time data in one place. Cloud Dataflow supports a PubSub model (publish and subscribe) for streaming data, or Kafka if they choose.

“We think the democratization of streaming is the big news here,” Kershaw says. “Up until now historical data and batch analyst have been separated. What we want to do is make streaming easy and flexible and normal….We think that’s really going to change the way people approach [big data analytics] because they’ll start thinking about data as it occurs, co-existing with data that’s historical.”Google CLoud logo

Google also made some news regarding its flagship big data analytics engine, BigQuery. For the first time, BigQuery customers can control where their data sits. This is especially important for European customers, who face regulations regarding where their own clients’ data is stored.

BigQuery is getting lots of traction among customers of all sizes, Kershaw says. “It uses traditional Google technology on the back end, but the front-end is SQL,” he says. That means you can use it to run SQL statements, or simply plug a SQL-speaking BI tool, such as Tableau or Qlik, into it.

“BigQuery is a unique product that’s done entirely well in the market,” Kerhaw says. “Large-scale customer use it to run device analytics across millions of devices. But it’s also used by small developers who want to look at log files and be able to optimize their infrastructure.”

Google Cloud Dataflow is now available. Pricing starts at $0.01 per Google Compute Engine Unit (GCEU) per hour, while streaming workloads cost $0.015 per GCEU per hour. For more info see cloud.google.com/dataflow.

Related Items:

Cloudera Teams with Google to Run Dataflow on Spark

Google Re-Imagines MapReduce, Launches DataFlow

Google Bypasses HDFS with New Cloud Storage Option

Datanami