Follow Datanami:
February 24, 2016

Google Releases Cloud Processor For Hadoop, Spark


Google took the wraps off of its managed Apache Hadoop and Spark service this week, saying its cloud data processing platform is intended to reduce the cost and ease management of processing big datasets.

Cloud Dataproc, which moved from beta testing to general availability on Monday (Feb. 22), is designed to quickly spin-up clusters that can be resized from three to 300 nodes, according the Google (Nasdaq: GOOG, GOOGL). The automation tool is intended to shift users’ focus from data processing and toward data analysis. Background cluster processing is the best way to achieve that balance, the search giant argues.

Google Cloud Dataproc entered beta testing last fall. Trial customers created clusters ranging in size from three to “thousands” of virtual CPUs, the company said. Dataproc clusters can be spun up as needed. Integration with Google Cloud makes Dataproc clusters independent of its storage platform.

Google said it has also integrated its BigQuery and Cloud Bigtable capabilities with Dataproc. It also can be used in conjunction with Google Cloud Dataflow for real-time batch and stream processing.

Several features were added to the processing platform during beta testing, including data “property tuning,” virtual machine metadata and tagging and cluster versioning. This week’s release also included support for custom machine types, the company added. (Dataproc clusters are built on Google Compute Engine instances. Machine types define the virtualized hardware resources available to an instance.)

Reducing cost and complexity related to data processing are two key goals for Dataproc. “Using Spark and Hadoop should not break the bank [and] you should pay for what you actually use,” the cloud vendor stressed in a blog post. Hence, it is pricing Cloud Dataproc at 1 cent per virtual CPU in a cluster per hour.

Another goal is to speed up data processing using Hadoop and Spark. Google claimed Dataproc clusters start and stop operations in 90 seconds of less. Hence, users spend more time analyzing data than waiting on clusters.

Meanwhile, the cluster versioning feature provides access to stable versions of Spark and Hadoop, Google added.

This week’s release also includes image version 1.0.0 to support Hadoop 2.7.2, Spark 1.6.0, Hive 1.2.1 and Pig 0.15.0 releases. Google stressed that the provision of updated and native versions of Hadoop, Hive, Pig and Spark eliminates the need for new tools or APIs. It also means existing projects and ETL pipelines can be moved to its data processing platform without redevelopment.

Google also announced Dataproc support from third-party tool vendors and service partners. Tool partners include Arimo, Attunity, Looker, WANdisco and Zoomdata. New service partners include Moser, Pythian and Tectonic.

The release adds momentum to the enterprise shift toward Spark that brings with it management challenges related to resource constraints and data siloes. Hence, Google stressed that Cloud Dataproc is designed to increased availability while automating cluster administration.

Recent items:

WANdisco Unveils Data Migration Solution for Google Cloud Dataproc

Zoomdata Introduces High Performance Visual Analytics for Google Cloud Dataproc