Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
ISC'13

May 22, 2012

The Sound and The NoSQL Fury


The signal-to-noise ratio in the NoSQL world has made it hard to figure out what’s going on, or even who has something new. For all the talk of performance in the NoSQL world, much of the most exciting part of what’s new is really not about performance at all.  

Take for example, MongoDB, which has a really great data model and MapReduce has a very handy scripting language.  These are genuine and probably long-lasting contributions.  Their innovation is all about finding a new language to use for interacting with data.  They are about NoSQL.

The confusion comes, for me, when we get to the performance side of the equation.  Let me just say at the outset that I’m focusing on very large data, which I define to be data that is much larger than available RAM.  In such cases, you can get a pretty good idea of performance if you think about how many disk seeks a workload will induce.  That’s not the whole story, but it’s a good start for discussion since disk seeks are so much slower than other operations.

In this context, there is no a priori reason that replacing SQL is going to do anything to your performance.  Berkeley DB is quite a nice implementation of a B-tree, and you can integrate it with MySQL or run it stand-alone.  

In one case, it’s SQL and the other case it’s NoSQL.  If you insert a key value pair (which in the SQL world we call a row) into a larger-than-memory index, you’ll avoid a disk seek in both cases if the target leaf is cached, and otherwise you won’t.  So a random-insertion workload will cause lots of seeks and a sequential-insertion workload will cause very few.  The interface language, whether it is a SQL or NoSQL protocol, simply doesn’t come in to it.

So what about those performance claims? Unfortunately, buzzy technologies often get strung together, often with unintended consequences.  For example, I was once asked what I thought about combining Hadoop with SSDs.  Two great buzz words that go great together?  Hadoop makes it easy to scan through all the data.  It’s perfect for those situations where no index is going to help -- say if your are in full exploration mode (e.g. What are the top five most popular values in the second field?) or if most queries really do touch every row … I mean key, value pair.  As such, its I/O characteristics are going to be limited by bandwidth, rather than I/Os per second (IOPS).  SSDs, on the other hand, are devices that deliver inexpensive IOPS and very expensive bandwidth.  So, although it’s great to try out combinations of the latest and greatest, it can also be costly and buzzwords are no substitute for analysis.

And notice that it’s important to carve out which parts of the performance pie a specialty solution is going to work on.  Hadoop, like MapReduce, is good for looking at all the data.  If you have a query load that benefits from an index, however, such systems will be crushingly slow compared to an indexed system.  Back in the early days at Google, I used the precursor to MapReduce, which we called the Ripper, since we used to it rip through all the web pages in a crawl repository.  And what was the Ripper used for the most at Google?  To build indexes!  The overwhelming majority of Google’s workload benefits from indexing, and I suspect that the overwhelming majority of all workloads do too.

So how do you get performance gains in a general-purpose system?  You can get rid of features that slow you down, go to faster hardware, or have a genuine algorithmic breakthrough.  Algorithmic breakthroughs that have an impact on practice are very rare.  Log-structured Merge Trees (LSM) are used by Cassandra to speed up writes.  But LSMs have slow queries.  Cassandra uses Bloom Filters to speed up point queries, but range queries remain slow.  Fractal Tree Indexes are as fast as LSMs for insertions and as fast as B-trees for queries, so they dominate both in terms of performance.

The greatest confusion around NoSQL performance has to do with removing standard features of a SQL database.  For example, each transaction in an ACID-compliant system requires a write to disk.  That gives you durability, which is the D in ACID.  There are optimizations that speed up transactions, but it’s even faster not to guarantee durability at all!  If you can save up all the changes to your system until you have enough of them, then you can amortize the cost of writing them to disk and get much faster throughput.

If you happen to have a system that can tolerate data loss after a system crash, then not paying for durability is a great deal.  But if you need durability, using a database that doesn’t guarantee durability is a bad idea.  That would shift burden from the database to the application.  This makes it more expensive to code the application, and chances are, the aggregate system performance will be worse than if the database were to handle the durability.  After all, databases didn’t always have nice features like ACID.  ACID got added to databases because the physical-storage layer was the right place to implement such features.

If I’m right, when the NoSQL kerfuffle settles down, there will be NoSQL users who benefit from one of the new data models or scripting languages, maybe even running on top of a SQL database in some cases, and there will be users who don’t need all the features and guarantees of a standard database.  In other words, I’m predicting that NoSQL will be seen as a set of  great specialty solutions and that some of the best ideas will be incorporated into general purpose databases.  Once that happens, people will wonder what all the fuss about API languages was about.


About the Author

Dr. Martín Farach-Colton is an expert in algorithmic and information retrieval. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system. He is a founder and CTO of Tokutek, a company that is commercializing Fractal Tree index technology. Dr. Farach-Colton is also a Professor of Computer Science at Rutgers University.

 

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event