September 6, 2013

AeroSpike Says Secret to NoSQL Speed is Simplicity

Alex Woodie

AeroSpike made news with its NoSQL database earlier this year when it was able to maintain between 300,000 and 400,000 transactions per second on a benchmark test, about 10 times better than Apache Cassandra and MongoDB. The secret to this speed, the company says, is simplicity.

Okay, there’s just a little bit more to it than mere simplicity. Even the folks at AeroSpike would give you that. In the end analysis — if AeroSpike’s claims are true — it is engineering and design acumen that will get credit for the scalability and performance capabilities that AeroSpike demonstrated on the Thumbtack Technology benchmark.

AeroSpike is a NoSQL database designed for high-throughput, big data applications, particularly those Internet applications that place a high value on real-time processing, as opposed to massive after-the-fact analysis, as Hadoop does so well.

The defining characteristic of this key-value, row-based, NoSQL data store is its hybrid in-memory architecture. AeroSpike indexes are stored in DRAM for fast access, while the data itself is stored on Flash-based solid state disks (SSDs) for fast reads and writes. (The company gives customers the option of storing data on traditional hard disks, but most choose the SSDs.)

The company says it built its database specifically to run on systems equipped with Flash SSDs. This includes using small block reads and large block writes, and parallelizing the work across multiple SSDs for throughput. It also added some other stuff, like expiration and eviction functions that ensure continued performance, and warm re-start capabilities that provide high availability across commodity X86 clusters.

All this design and engineering was put into place with a single focus in mind: To drive the highest possible read and write performance and scalability out of the cheapest SSD-based servers around.

Simply using a Flash SSD as a replacement for spinning disk does not lead to the highest levels of performance, AeroSpike founder and CTO, Brian Bulkowski, explained in a recent webinar. Instead, to get the most bang out of their Flash SSD buck, changes must be made in the database itself.

“The reason is that old database, old relational systems, are optimized to avoid rotational disk seeks. They’re optimized to create streaming patterns, to group like kinds of data with like data,” he said.

Using Flash SSDs with traditional relational database technology yields a performance gain of three to four times, he said. That may sound like a lot, but considering that Flash SSDs cost on the order of a hundred times more than spinning disk, that math does not add up, on a purely per gigabyte basis.

“However if you really work to optimize your database architecture and your technology stack to use the random access capabilities of SSDs, to properly code for multithreaded storage access, you find that databases can be built, like AeroSpike, which are 10 to 100 times faster, instead of just three to four times faster,” Bulkowski says.

In a recent blog post titled “Simplicity–The Secret to Scaling,” AeroSpike showed how this approach to designing database software can lead to simplicity on the hardware side. Instead of assembling massive clusters of X86 clusters, AeroSpike enables customers to get really good performance on relatively paltry setups.

The company compared how a SSD-based AeroSpike setup would fare against an in-memory setup using Memcache, a distributed memory caching system, and Redis, a document-based data store, in two different scenarios. The first comparison involved a 1TB database processing 50,000 transactions per second (TPS). AeroSpike was able to handle the database load with just three servers versus 14 servers using Memcache or Redis.

The second comparison (a video ad serving platform) had much bigger requirements, including a 5TB database processing 500,000 TPS. The hybrid SSD-DRAM setup running the AeroSpike database was able to handle the load with just 14 servers, at a total cost of $322,000, compared to 186 servers using NoSQL running on clusters of servers that use a lot of DRAM and cost $5.6 million. 

“The goal was to compare using the ‘sweet spot’ of DRAM, vs the ‘sweet spot’ of Flash,” Bulkowski wrote in the blog. “For DRAM right now, buying servers with about 196GB seemed the best choice in terms of total cost per byte. As you buy larger servers, the density is higher…but the cost per byte goes up.”

Related Articles

Cloudera Search 1.0: Like Googling Hadoop

Stinger Looking to Tez to Cross 100x Performance Line for Hive

V is for Big Data Virtualization