Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
ISC'13

October 18, 2012

How Google’s Dremel Makes Quick Work of Massive Data


The ability to process more data and the ability to process data faster are usually mutually exclusive. According to Armando Fox, professor of computer science at University of California at Berkeley, “the more you do one, the more you have to give up on the other.”

Hadoop, an open-source, batch processing platform that runs on MapReduce, is one of the main vehicles organizations are driving in the big data race.

However, Mike Olson, CEO of Cloudera, an important Hadoop-based vendor, is looking past Hadoop and toward today’s research projects. That includes one named Dremel, possibly Google’s next big innovation that combines the scale of Hadoop with the ever-increasing speed demands of the business intelligence world.

“People have done Big Data systems before,” Fox said “but before Dremel, no one had really done a system that was that big and that fast.”

The key to Dremel’s speed, according to the Google paper detailing the project, is their columnar storage. Per Google, “Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression.”

Even in MapReduce, Google suggests that a columnar approach to storage would be more efficient and not very difficult to operate. Their sample “column-striping algorithm” splits records into columns over just 22 lines of code. According to the paper, striping the files into columns represents a time reduction of almost an order of magnitude for jobs run over 3000 nodes.

Google equates that to going from running in hours to minutes. Meanwhile, those same jobs are completed another order of magnitude faster when using Dremel (going from minutes to seconds).

Of course, columnar storage isn’t the only thing driving Dremel’s speed, otherwise columnar MapReduce and Dremel would be equivalent. Google also points to the language in which queries can be made, a high-level SQL-based language which does not have to be translated into MapReduce form. “In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.”

Further, and possibly just as important, Dremel borrows its architecture from that of large-scale distributed search engines (which Google may know a thing or two about).

It should be noted that Google is intending Dremel as a complement, not a replacement, for MapReduce and Hadoop. According to the paper, Dremel is frequently used to analyze MapReduce results or serve as a test run for large scale computations. “Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time.” As noted before, Dremel experimentally surpassed MapReduce by orders of magnitude.

One of Dremel’s advantages is also a potential drawback. Whenever parallel processing takes place across many nodes, in this case from one to four thousand, there will inevitably be nodes that fall behind or fail entirely. Google denotes these as “stragglers” and they can significantly increase the query response time from under a minute to several minutes. However, this problem can be eliminated if it can be determined that a vast majority (99%) of data read is acceptable versus the entire set.

Per the paper, “If trading speed against accuracy is acceptable, a query can be terminated much earlier and yet see most of the data... The bulk of a web-scale dataset can be scanned fast. Getting to the last few percent within tight time bounds is hard.”

It was unlikely that there would have to be no sacrifices made to produce a system that could analyze a large amount of data quickly. But in the long run, a small hit in accuracy may be a small price to pay if Dremel can deliver on the scale and velocity fronts.

Related Articles

Mortar Takes Aim at Hadoop Usability

Researchers Target Storage, MapReduce Interactions

Managing MapReduce Applications in a Shared Infrastructure

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There is 1 discussion item posted.

Nice
Submitted by makhojaye on Oct 20, 2012 @ 2:55 PM EDT


Nice. Thanks for sharing. It would be more helpful, if you can post some references and links of the papers related to Dremel Performance Benchmark results.
http://muhammadkhojaye.blogspot.com

Post #1

 
SGI Hadoop

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event