February 13, 2012

Big Data and The SSD Mystique

Josh Goldstein

SSDs, and more generically, non-volatile solid state technology (typified by flash today), are a hot topic in big data, not to mention the data center right now. 

The attention is justified, as there hasn’t been such a transformative technology on the horizon for several years.  With several established and startup companies developing competing architectures and solutions based around SSDs, is there room for another startup?  Aren’t we well covered between PCIe flash cards, SSDs, flash caching solutions, flash tiers in storage arrays, and all-flash application accelerators?

The answer is no.  There is still a massive gap in the market waiting to be filled.  The challenges that need to be addressed are twofold.  First, true scalability with flash media has not yet been achieved.  And second, flash has not yet been utilized in innovative ways to improve the storage experience beyond performance.

Let’s dive into the first challenge.  The solutions on the market today are a big improvement over disk-based storage.  After all, moving from tens of thousands of IOPS to hundreds of thousands of IOPS is an order of magnitude improvement (not to mention the commensurate drop in latency). 

This is very exciting for today’s end-users.  But history shows us that when new computing technology emerges, it doesn’t take long for people to figure out how to utilize the performance gains and quickly be left wanting even more.  Think about it.  Will you be forever happy with the servers and networking you have today?  Or do you want them to improve year-after-year to keep up with new and perhaps unpredictable demands?  The same is true for storage.  Today’s order of magnitude improvement will be quickly absorbed and deemed insufficient tomorrow.

The current generation of systems-level SSD products have not been engineered with this in mind.  The landscape exclusively consists of “scale-up” architectures that require forklift upgrades whenever performance limits are reached.  SSDs have such immense performance potential that the scale-up model cannot be sustained. 

Storage processing at the array controller level inevitably becomes a bottleneck.  The long-term model for success is a “scale-out” design, where individual building blocks are clustered together in a common system, allowing capacity and performance to be added dynamically, and if well designed, without limit.  With scale-out architectures, today’s need for hundreds of thousands of IOPS can be met while still providing for tomorrow’s need for millions (and someday billions) of IOPS.

The second opportunity for startups in the SSD arena arises in the software stack.  The current crop of SSD products are adaptations of designs that originated years ago in hard disk-based arrays.  There are valid gains to be had by replacing HDDs with SSDs, but it doesn’t truly unlock the potential that SSDs have to offer.  End-users quickly discover that their array now reaches its performance potential with fewer drives populated, but the performance limits themselves have not changed.  In order to deliver the full potential of SSDs, and entirely new software architecture must be developed.  Only then can the performance of arrays full of SSDs be delivered.

Performance is only one aspect to consider.  Every capability in the storage software stack must also be reexamined.  Storage architects and administrators have learned over decades that there are certain “truths” in how storage systems behave, in their capabilities, and in their limitations. 

These “truths” are rooted in software architectures designed for hard drives, and will not change simply by substituting SSDs.  The next big leap in storage system capabilities will come by creatively thinking about not just the performance potential of SSDs, but how their unique properties can be exploited.  Everything from data protection to efficiency to ease-of-use to array-based copy services can be dramatically improved.

Big data is an area that can truly benefit from storage innovation based on SSDs.  Many organizations are currently constrained in the types of analytics they’re able to perform because it is impossible or uneconomical to perform the queries using today’s storage technology – even when accelerated by SSDs (they hit the array controller limits and then cannot scale-out).  Or worse, the queries take too long to complete and the results are out of date before they can be used. 

We have talked to retailers, telcos, financial institutions, and government entities that need to process data in real-time, in increasing volumes, and with more complex query structures in order to detect fraud, price products, or analyze quickly changing trends.  What they envision simply cannot be achieved on today’s technology, but if given the tools, entirely new use cases open up.  It may be hard to believe that somebody could make productive use of millions of IOPS, but the latent desire is there waiting to be unleashed.  Application developers and IT architects will quickly adopt these new storage and data processing technologies when they are brought to market.

Another key factor to consider with SSDs is whether to pursue a server-centric or storage-centric approach.  The server-centric approach typically involves PCIe-based flash cards populated in hosts and is analogous to direct-attached-storage (DAS).  This is a great approach when the data sets are small enough to fit completely in the PCIe card and expandability, data protection, disaster recovery, high availability, and the need to share the data set are not concerns. 

Storage-centric designs place SSDs in a shared array where the resource can be accessed by multiple hosts, and in more advanced designs, is also well protected from failures and highly available.  The advantages and drawbacks are essentially the same as DAS vs. SAN with traditional disk arrays.  In the end, there is no one correct model and the application environment, performance and availability profile, and data set sizes and growth rates must all be considered.

The level of virtualization in a data center is a key determinant for both the need to use SSD technology and for whether to use a server or storage-centric model.  In general, the more heavily virtualized and the more CPUs and cores per host, the more attractive SSDs and a storage-centric design become. 

Virtualization running on multi-core CPUs creates highly random workloads as seen by storage devices, even if individual guest operating systems and applications are reading and writing sequentially.  For big data applications where multiple hosts need to process common data sets, a storage-centric SSD design may be the only viable choice.

The future is very bright for SSDs and the best of what this technology has to offer is still to come.  Fast forward a few years into the future and we can be virtually guaranteed that applications will exist that we couldn’t have even imagined today, made possible by the performance potential of SSDs and the architectures innovative start-ups are creating right now to unleash them in the datacenter.

Related Stories

Big Data I/O Benchmark Gains Steam

Fusion-io Flashes the Future of Storage

DDN Intros New Big Data Appliances

About the Author

Josh Goldstein is Vice President of Marketing and Product Management at XtremIO, a provider of 100% flash scale-out enterprise SAN storage arrays.  XtremIO is currently in customer trials.  For more information, contact info@xtremio.com.