June 25, 2014

Google Re-Imagines MapReduce, Launches DataFlow

Alex Woodie

It’s well known in the industry that more than 10 years ago Google invented MapReduce, the technology at the heart of first-generation Hadoop. It’s less well known that Google moved away from MapReduce several years ago. Today at its Google I/O 2014 conference, the Web giant unveiled a possible successor to MapReduce called Dataflow, which it’s selling through its hosted cloud service.

Google Cloud Dataflow is a managed service for creating data pipelines that ingest, transform, and analyze massive amounts of data, up into the exabyte range. The same work done in the Dataflow SDK can be used for either batch or streaming analytics, Google says. Dataflow is based internal Google technologies like Flume and MillWheel, and can be thought of as a successor to MapReduce that’s especially well-suited for massive ETL jobs.

Urs Hölzle, Google’s senior vice president of technical infrastructure and a Google Fellow, described Dataflow during a keynote address at today’s Google I/O conference in San Francisco.

“We’ve done analytics at scale for a while and we’ve learned a few things. For one we don’t really use MapReduce anymore. It’s great for simple jobs but it gets too cumbersome as you build pipelines, and really everything is an analytics pipeline these days,” Hölzle says.

“So what we needed was a new analytic system that scales to exabytes and makes it really easy to write pipelines, that optimizes these pipelines for you and that lets you use the same code for both batch and streaming analytics. Today we’re annoying just that with Cloud Dataflow.”

The new service lets developers construct an analytics workflow and then send it off to the Google Cloud for execution. “Cloud Dataflow is an SDK and a managed service for building big and fast parallelized data analysis pipelines,” Google engineer Eric Schmidt (not the CEO) said during a demo. “You write a program; there’s a logical set of data transformations to specify your analysis. You then submit the pipeline to the data flow service, and it handles all optimization deployment of VMs, scheduling, and monitoring for you.”

What’s more, the parallelization is optimized for the developer, who doesn’t have to worry about the whole analytics process as much as they would using MapReduce. “Dataflow provides powerful built-in primitives for doing MapRedue-like and continual computation operations,” Schmidt says. “It handles all the aging out of data, shuffling etc. I don’t have to worry about that.”

During the demo, Schmidt showed an application that analyzed the sentiment of soccer fans during World Cup matches, expressed via Twitter. Here’s how it works: Dataflow grabs the tweets, converts the JSON into objects, calls Google Translate if it’s not in English, and then scores the sentiment expressed in those words by calling AlchemyAPI. (By the way, in addition to extracting entities from language, AlchemyAPI has done some fascinating work with image classification using GPUs; check out our write-up on the company from May).

“Two lines of code to create sliding windows and average all the tweets over that window,” Schmidt says. “We’re reducing a massive collection of data down to one record per minute per team.”

Google isn’t in the software business per se, so you can’t go out and buy the software for your Hadoop cluster. You can, however, move your analytics to the cloud. In fact, if you want to get your hands on Dataflow, that is what you’ll have to do.

A lot has changed since Google created MapReduce a decade ago to help solve its data classification problem for Web search. Twitter and Facebook weren’t even invented yet. Much of the big data sets that organizations like Google want to analyze now come from social media, and developers are moving quickly to come up with alternatives to MapReduce.

“Information’s is being generated at an incredible rate and of course we want you to be able to analyze that information without worrying about scalability,” Hölzle says. “And today even when using MapReduce, which we invented over a decade ago, it’s still cumbersome to write and maintain analytics pipelines. And if you want streaming analytics, you’re out of luck. And if most instance, once you have more than a few petabytes, they kid of break down….I hope you understand now why we stopped using MapReduce years ago.”

One technology that’s emerged as a potential alternative to MapReduce is Apache Spark, which touts the same sort of interactive and real-time analytic capabilities as DataFlow, but from an open source API, as opposed to a proprietary API like Google’s. The 2014 Spark Summit starts Monday at the St. Francis Hotel in San Francisco. But today it was Google’s show down the road at the Moscone Center.

Can Google Harness Big Data to Ward Off Death?

Thinking in 10x and Other Google Directives

Applications: Complex Event Processing, Data Mining

Technologies: Cloud

Sectors: Retail

Vendors: google, Google Cloud

Tags: google, Hadoop, mapreduce

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Google Re-Imagines MapReduce, Launches DataFlow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Google Re-Imagines MapReduce, Launches DataFlow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link