Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan
Leverage Big Data'14

May 22, 2012

The Sound and The NoSQL Fury

The signal-to-noise ratio in the NoSQL world has made it hard to figure out what’s going on, or even who has something new. For all the talk of performance in the NoSQL world, much of the most exciting part of what’s new is really not about performance at all.  

Take for example, MongoDB, which has a really great data model and MapReduce has a very handy scripting language.  These are genuine and probably long-lasting contributions.  Their innovation is all about finding a new language to use for interacting with data.  They are about NoSQL.

The confusion comes, for me, when we get to the performance side of the equation.  Let me just say at the outset that I’m focusing on very large data, which I define to be data that is much larger than available RAM.  In such cases, you can get a pretty good idea of performance if you think about how many disk seeks a workload will induce.  That’s not the whole story, but it’s a good start for discussion since disk seeks are so much slower than other operations.

In this context, there is no a priori reason that replacing SQL is going to do anything to your performance.  Berkeley DB is quite a nice implementation of a B-tree, and you can integrate it with MySQL or run it stand-alone.  

In one case, it’s SQL and the other case it’s NoSQL.  If you insert a key value pair (which in the SQL world we call a row) into a larger-than-memory index, you’ll avoid a disk seek in both cases if the target leaf is cached, and otherwise you won’t.  So a random-insertion workload will cause lots of seeks and a sequential-insertion workload will cause very few.  The interface language, whether it is a SQL or NoSQL protocol, simply doesn’t come in to it.

So what about those performance claims? Unfortunately, buzzy technologies often get strung together, often with unintended consequences.  For example, I was once asked what I thought about combining Hadoop with SSDs.  Two great buzz words that go great together?  Hadoop makes it easy to scan through all the data.  It’s perfect for those situations where no index is going to help -- say if your are in full exploration mode (e.g. What are the top five most popular values in the second field?) or if most queries really do touch every row … I mean key, value pair.  As such, its I/O characteristics are going to be limited by bandwidth, rather than I/Os per second (IOPS).  SSDs, on the other hand, are devices that deliver inexpensive IOPS and very expensive bandwidth.  So, although it’s great to try out combinations of the latest and greatest, it can also be costly and buzzwords are no substitute for analysis.

And notice that it’s important to carve out which parts of the performance pie a specialty solution is going to work on.  Hadoop, like MapReduce, is good for looking at all the data.  If you have a query load that benefits from an index, however, such systems will be crushingly slow compared to an indexed system.  Back in the early days at Google, I used the precursor to MapReduce, which we called the Ripper, since we used to it rip through all the web pages in a crawl repository.  And what was the Ripper used for the most at Google?  To build indexes!  The overwhelming majority of Google’s workload benefits from indexing, and I suspect that the overwhelming majority of all workloads do too.

So how do you get performance gains in a general-purpose system?  You can get rid of features that slow you down, go to faster hardware, or have a genuine algorithmic breakthrough.  Algorithmic breakthroughs that have an impact on practice are very rare.  Log-structured Merge Trees (LSM) are used by Cassandra to speed up writes.  But LSMs have slow queries.  Cassandra uses Bloom Filters to speed up point queries, but range queries remain slow.  Fractal Tree Indexes are as fast as LSMs for insertions and as fast as B-trees for queries, so they dominate both in terms of performance.

The greatest confusion around NoSQL performance has to do with removing standard features of a SQL database.  For example, each transaction in an ACID-compliant system requires a write to disk.  That gives you durability, which is the D in ACID.  There are optimizations that speed up transactions, but it’s even faster not to guarantee durability at all!  If you can save up all the changes to your system until you have enough of them, then you can amortize the cost of writing them to disk and get much faster throughput.

If you happen to have a system that can tolerate data loss after a system crash, then not paying for durability is a great deal.  But if you need durability, using a database that doesn’t guarantee durability is a bad idea.  That would shift burden from the database to the application.  This makes it more expensive to code the application, and chances are, the aggregate system performance will be worse than if the database were to handle the durability.  After all, databases didn’t always have nice features like ACID.  ACID got added to databases because the physical-storage layer was the right place to implement such features.

If I’m right, when the NoSQL kerfuffle settles down, there will be NoSQL users who benefit from one of the new data models or scripting languages, maybe even running on top of a SQL database in some cases, and there will be users who don’t need all the features and guarantees of a standard database.  In other words, I’m predicting that NoSQL will be seen as a set of  great specialty solutions and that some of the best ideas will be incorporated into general purpose databases.  Once that happens, people will wonder what all the fuss about API languages was about.

About the Author

Dr. Martín Farach-Colton is an expert in algorithmic and information retrieval. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system. He is a founder and CTO of Tokutek, a company that is commercializing Fractal Tree index technology. Dr. Farach-Colton is also a Professor of Computer Science at Rutgers University.


Share Options


» Subscribe to our weekly e-newsletter


There are 0 discussion items posted.


Most Read Features

Most Read News

Most Read This Just In

Cray Supercomputer

Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia


Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014

» View/Search Events

» Post an Event