The ability to process more data and the ability to process data faster are usually mutually exclusive. According to Armando Fox, professor of computer science at University of California at Berkeley, “the more you do one, the more you have to give up on the other.”
Hadoop, an open-source, batch processing platform that runs on MapReduce, is one of the main vehicles organizations are driving in the big data race.
However, Mike Olson, CEO of Cloudera, an important Hadoop-based vendor, is looking past Hadoop and toward today’s research projects. That includes one named Dremel, possibly Google’s next big innovation that combines the scale of Hadoop with the ever-increasing speed demands of the business intelligence world.
“People have done Big Data systems before,” Fox said “but before Dremel, no one had really done a system that was that big and that fast.”
The key to Dremel’s speed, according to the Google paper detailing the project, is their columnar storage. Per Google, “Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression.”
Even in MapReduce, Google suggests that a columnar approach to storage would be more efficient and not very difficult to operate. Their sample “column-striping algorithm” splits records into columns over just 22 lines of code. According to the paper, striping the files into columns represents a time reduction of almost an order of magnitude for jobs run over 3000 nodes.
Google equates that to going from running in hours to minutes. Meanwhile, those same jobs are completed another order of magnitude faster when using Dremel (going from minutes to seconds).
Of course, columnar storage isn’t the only thing driving Dremel’s speed, otherwise columnar MapReduce and Dremel would be equivalent. Google also points to the language in which queries can be made, a high-level SQL-based language which does not have to be translated into MapReduce form. “In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.”
Further, and possibly just as important, Dremel borrows its architecture from that of large-scale distributed search engines (which Google may know a thing or two about).
It should be noted that Google is intending Dremel as a complement, not a replacement, for MapReduce and Hadoop. According to the paper, Dremel is frequently used to analyze MapReduce results or serve as a test run for large scale computations. “Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time.” As noted before, Dremel experimentally surpassed MapReduce by orders of magnitude.
One of Dremel’s advantages is also a potential drawback. Whenever parallel processing takes place across many nodes, in this case from one to four thousand, there will inevitably be nodes that fall behind or fail entirely. Google denotes these as “stragglers” and they can significantly increase the query response time from under a minute to several minutes. However, this problem can be eliminated if it can be determined that a vast majority (99%) of data read is acceptable versus the entire set.
Per the paper, “If trading speed against accuracy is acceptable, a query can be terminated much earlier and yet see most of the data... The bulk of a web-scale dataset can be scanned fast. Getting to the last few percent within tight time bounds is hard.”
It was unlikely that there would have to be no sacrifices made to produce a system that could analyze a large amount of data quickly. But in the long run, a small hit in accuracy may be a small price to pay if Dremel can deliver on the scale and velocity fronts.