October 18, 2012

How Google’s Dremel Makes Quick Work of Massive Data

Ian Armas Foster

The ability to process more data and the ability to process data faster are usually mutually exclusive. According to Armando Fox, professor of computer science at University of California at Berkeley, “the more you do one, the more you have to give up on the other.”

Hadoop, an open-source, batch processing platform that runs on MapReduce, is one of the main vehicles organizations are driving in the big data race.

However, Mike Olson, CEO of Cloudera, an important Hadoop-based vendor, is looking past Hadoop and toward today’s research projects. That includes one named Dremel, possibly Google’s next big innovation that combines the scale of Hadoop with the ever-increasing speed demands of the business intelligence world.

“People have done Big Data systems before,” Fox said “but before Dremel, no one had really done a system that was that big and that fast.”

The key to Dremel’s speed, according to the Google paper detailing the project, is their columnar storage. Per Google, “Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression.”

Even in MapReduce, Google suggests that a columnar approach to storage would be more efficient and not very difficult to operate. Their sample “column-striping algorithm” splits records into columns over just 22 lines of code. According to the paper, striping the files into columns represents a time reduction of almost an order of magnitude for jobs run over 3000 nodes.

Google equates that to going from running in hours to minutes. Meanwhile, those same jobs are completed another order of magnitude faster when using Dremel (going from minutes to seconds).

Of course, columnar storage isn’t the only thing driving Dremel’s speed, otherwise columnar MapReduce and Dremel would be equivalent. Google also points to the language in which queries can be made, a high-level SQL-based language which does not have to be translated into MapReduce form. “In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.”

Further, and possibly just as important, Dremel borrows its architecture from that of large-scale distributed search engines (which Google may know a thing or two about).

It should be noted that Google is intending Dremel as a complement, not a replacement, for MapReduce and Hadoop. According to the paper, Dremel is frequently used to analyze MapReduce results or serve as a test run for large scale computations. “Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time.” As noted before, Dremel experimentally surpassed MapReduce by orders of magnitude.

One of Dremel’s advantages is also a potential drawback. Whenever parallel processing takes place across many nodes, in this case from one to four thousand, there will inevitably be nodes that fall behind or fail entirely. Google denotes these as “stragglers” and they can significantly increase the query response time from under a minute to several minutes. However, this problem can be eliminated if it can be determined that a vast majority (99%) of data read is acceptable versus the entire set.

Per the paper, “If trading speed against accuracy is acceptable, a query can be terminated much earlier and yet see most of the data… The bulk of a web-scale dataset can be scanned fast. Getting to the last few percent within tight time bounds is hard.”

It was unlikely that there would have to be no sacrifices made to produce a system that could analyze a large amount of data quickly. But in the long run, a small hit in accuracy may be a small price to pay if Dremel can deliver on the scale and velocity fronts.

Related Articles

Mortar Takes Aim at Hadoop Usability

Researchers Target Storage, MapReduce Interactions

Managing MapReduce Applications in a Shared Infrastructure

Applications: Data Mining

Technologies: Frameworks

Tags: big data, dremel, google, Hadoop, mapreduce, strata12

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

How Google’s Dremel Makes Quick Work of Massive Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

How Google’s Dremel Makes Quick Work of Massive Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link