Follow Datanami:
June 6, 2012

Object Storage and the Unstructured Data Boom

Datanami Staff

Storage vendors are scrambling like mad to wrap their current offerings around a big data message, a feat that isn’t exactly incredible given the scale of data, but one that gets trickier when addressing the complexity and speed of working what that data.

One of the more outspoken voices in big data storage is Tom Leyden, who runs the trade show and blog route touting the advantages of object storage for big data.

Leyden directs alliances and marketing at Belgian object storage company, Amplidata following a careers at other startups that helped pioneer cloud computing technologies (including a stint at Q-Layer, which was among the first to build an IaaS platform—a feat that landed them in acquisition territory when Sun spotted them).

The object storage guru is heading on the road for next week’s Cloud Expo event in New York City (June 11-14) where he is set to talk about how object storage is tackling demands in unstructured data-heavy industries including entertainment, “big science” and retail.

His assertion is that unstructured data is becoming the most prevalent type of data at large organizations and that companies facing the challenges of the parade of unstructured petabytes are looking beyond the mighty file system.

In advance of his talk about why he believes object storage presents a more flexible and scalable solution, we caught up with him to dig deeper into why he thinks object storage will house the next generation of massive unstructured data.

First, please explain your assertion that object storage is a more flexible, more scalable solution than traditional storage technologies that often can be provided at far lower cost.

Object storage is the way Facebook, Google, and Amazon store data and scale massively. can not only protect data from higher numbers of drive failures, but also against the failure of entire storage Amplidata Confidential modules.” can not only protect data from higher numbers of drive failures, but also against the failure of entire storage Amplidata Confidential modules.”  Reports say that Amazon is storing 600 billion objects, or nearly 762 billion files.  The average enterprise may not require this scalability today, but with data storage requirements projected to grow by a factor 30 over the next decade – and that 80% of that being large data files such as office documents, movies, music, pictures – many organizations will soon be coping with similar demands.

We have documented a 50 to 70% reduction in storage capacity at a major online network that moved to an object storage system[1] – reducing the typical requirement of 250% overhead with redundant copies placed in the cloud, down to only 60% overhead.   The same system also delivered a 50% reduction in storage footprint  compared to mirrored RAID.  At the same time this system met an exceptionally stringent availability policy, and with power consumption at less than 3.5 Watt per TB of data, this particular object storage system also allowed the IT department to save on energy and data cooling costs.

Not surprisingly, just about every major data storage vendor has adopted some kind of object storage implementation into a packaged storage system offering, particularly for use in cloud-facing applications or when scalability is imperative.  That said, the devil is always in the details – and that’s where enterprise IT managers have to read beyond the buzzwords.

1) The above results reflect an object storage system that is well suited for the users’ needs – in this case, for Big Unstructured Data applications.   The user employed one a data protection algorithm –erasure encoding – .that drastically reduces the overhead required to provide high availability and can be used with very power efficient hardware.  One could say that those are the only true object stores as RAID does not allow to create one single scalable storage pool (you would be putting RAID systems together with some layer on top).

The implementation also provides the customer with the flexibility of using various availability policies, which is often critical.

2) Achieving strong results from object storage also presupposes that the object stores themselves were properly designed. At a major vendor’s user conference recently,  for example, a product manager told me the company is working on building an object layer on top of an existing NAS line. In my view, that doesn’t make sense. True object storage is designed as one uniformly scalable storage pool – one namespace – that can be deployed as small as a few hundred terabytes or as big as hundreds of petabytes. It’s hard to deploy a zettabyte-scale system just to demonstrate it is possible, but that should be the ultimate objective. 

We noticed that during your talk you’re going to hit on three main points. We wondered if you could provide more details about these and their impact on unstructured big data-driven organizations?

1)      Applications access the data directly through a REST API, eliminating common file system limits, and rendering the file system obsolete.

The first file systems were not designed for petabytes of data –bytes and kilobytes were a lot of data back then, and Gigabytes probably sounded pretty sci-fi. File systems played a very important role in the evolution of the computer industry– mainly enabling the use of directories.  Today, however, most companies’ directories are not that organized anymore because we have too much data. But that doesn’t matter anymore, because there are so many applications out there that can do this for us. Take Google Docs, Picasa and iTunes for example. Each lets you store and share your documents, photos or music, and organize them in collections. And no matter how you organize your stuff, Docs or Picassa or iTunes will find it back for you. There is hardly any role for the file system

For businesses the situation is the similar. Applications in the cloud are increasingly popular, so a lot of business data is already stored in a public or private object store. Many business applications still need a file system interface. For now, that is. If the current data growth continues, a lot of file systems will hit their scalability limits. And here object storage will play a very important role as object storage platforms have been (at least the good ones) designed to scale out big.

Most times, it just makes more sense to have the application talk directly to the storage, which is what object storage does. The REST interface – a protocol that allows applications to read/write data without a file system in between, in a very straight-forward way – makes it all very simple. And fast. And economically feasible.

2)      Object Storage was designed to scale uniformly beyond petabytes, independent of the underlying hardware.

No sane IT person wants to do data migration, particularly with petabyte-scale architectures. Think of data migration like trying to paint the Golden Gate Bridge: once you are at the end, you need to start painting on the other side. Yet the bridge is constantly getting longer as your data grows.

Properly designed object storage systems with the right algorithms (erasure coding is one)  allow customers to add new hardware, which is automatically integrated in the pool, without reconfiguration and decommission old hardware without needing to migrate the data. They perform this task automatically in the background.  Once the bridge is painted, when you add a new section, you simply paint what’s new.

3) Object Storage, particularly optimized object storage platforms, are more efficient via drastically reduced overhead needs and much lower power consumption

Using erasure coding technology as one example – this object storage implementation is more efficient because it does not copy data to protect it.  Instead, it stores data as equations, spread over the whole system. To read an object, the system needs only a selection of the equations, not every single one.  If a disk fails the entire system will generate new equations, again spread over the entire system.
This brings two benefits: 

less overhead to give better protection – from higher numbers of drive failures, and also against the failure of entire storage modules.

the ability for the storage nodes to leverage cheaper, lower-power consumption processors such as the ATOM chip used in mobile phones, to rebuild disk without performance loss.