June 23, 2015

Data Lake Showdown: Object Store or HDFS?

Alex Woodie

The explosion of data is causing people to rethink their long-term storage strategies. Most agree that distributed systems, one way or another, will be involved. But when it comes down to picking the distributed system–be it a file-based system like HDFS or an object-based file store such as Amazon S3–the agreement ends and the debate begins.

The Hadoop Distributed File System (HDFS) has emerged as a top contender for building a data lake. The scalability, reliability, and cost-effectiveness of Hadoop make it a good place to land data before you know exactly what value it holds. Combine that with the ecosystem growing around Hadoop and the rich tapestry of analytic tools that are available, and it’s not hard to see why many organizations are looking at Hadoop as a long-term answer for their big data storage and processing needs.

At the other end of the spectrum are today’s modern object storage systems, which can also scale out on commodity hardware and deliver storage costs measured in the cents-per-gigabyte range. Many large Web-scale companies, including Amazon, Google, and Facebook, use object stores to give them certain advantages when it comes to efficiently storing petabytes of unstructured data measuring in the trillions of objects.

But where do you use HDFS and where do you use object stores? In what situations will one approach be better than the other? We’ll try to break this down for you a little and show the benefits touted by both.

Why You Should Use Object-Based Storage

According to the folks at Storiant, a provider of object-based storage software, object stores are gaining ground among large companies in highly regulated industries that need greater assurances that no data will be lost.

“They’re looking at Hadoop to analyze the data, but they’re not looking at it as a way to store it long term,” says John Hogan, Storiant’s vice president of engineering and product management. “Hadoop is designed to pour through a large data set that you’ve spread out across a lot of compute. But it doesn’t have the reliability, compliance, and power attributes that make it appropriate to store it in the data lake for the long term.”

Object-based storage systems such as Storiant’s offer superior long-term data storage reliability compared to Hadoop for several reasons, Hogan says. For starters, they use a type of algorithm called erasure encoding that spreads the data out across any number of commodity disks. Object stores like Storiant’s also build spare drives into their architectures to handle unexpected drive failures, and rely on the erasure encoding to automatically rebuild the data volumes upon failure.

If you use Hadoop’s default setting, everything is stored three times, which delivers five 9s of reliability, which used to be the gold standard for enterprise computing. Hortonworks architect Arun Murthy, who helped develop Hadoop while at Yahoo, pointed out at the recent Hadoop Summit that if you only storing everything twice in HDFS, that it takes one 9 off the reliability, giving you four 9s. That certainly sounds good.

But one of the problems with big data is the law of large numbers. As the amount of data you’re storing creeps up into the petabyte range, your chances of losing a single byte of data suddenly becomes significant.

“When you do the math, when you get to 1PB, the equation changes,” Hogan says. If you have a system with 1PB of data on it, and you’re running a system with five 9s of reliability, your chances of losing data in a year is 12 percent, he says. While the odds of any one file being lost are incredibly small, you don’t get to pick which file is going to be lost, and that worries big companies.

Storiant recommends customers to store two copies of data, which translates to a mind-boggling 18 9s of reliability. “That’s the thing–as you start talking about large data, even if you have good reliability, you’re still going to lose data…You need a dozen or more 9s as it gets bigger, and some mechanism to know if you’re losing data.”

Bit loss rears its ugly little head when the data sets get really big. “You need to proactively manage that stuff isn’t silently disappearing even if you have three copies,” Hogan says. “Are those three copies coordinating with one another and restoring data that’s disappeared from one of the copies with bit loss? The answer in Hadoop is no, it’s no doing that. The bigger the data, the more important the reliability story and the more difficult the reliability story becomes. Unless you’re going to extreme measures to proactively fix that, then it’s just going to be gone and you won’t even know.”

Why You Should Use Hadoop for Your Next Data Lake

While the prospect of bit loss is enough to wake even the most hardened of CIOs up in the middle of the night, there are also some pretty good reasons to build your data lake on Hadoop. Consider these points that Soam Acharya, head of application architecture at Hadoop as a service provider Altiscale, made in a recent blog post.

“Many people choose Object Stores because they are marketed as a convenient, scalable, cheap,” Acharya writes. “However, if you want to do more than just park your data, HDFS is the better choice.”

For starters, HDFS was specifically designed to support the high-bandwidth access patterns that big data workloads demand. If you want to actually do data science on the data—as opposed to just having it sit there as an archived copy—then you need to be able to get at it easily and manipulate it.

“Data science involves constant inspection and transformation of large data sets spanning multiple files,” he writes. “To this end, being able to manipulate directories as well as files is important. While the names of files in an Object Store can have slashes in them, Object Stores do not truly support directories.”

You just can’t get the kind of intelligent data storage you can get from HDFS from an object store, he says. “Object Stores are great for objects, e.g., photos, Word documents, and videos,” Acharya writes. “But for interactive, high-bandwidth, sophisticated analysis of very large data sets, HDFS can’t be beat.”

Where We Go From Here

There are pros and cons to both technologies. There appears to be real momentum among large firms to use object stores—in particular those running in a private cloud environment–as a long-term repository for massive, unstructured data that needs to be kept for compliance reasons. Using HDFS for this task would seem to not make great sense.

On the other hand, object stores can’t deliver the richness of functionality that HDFS offers. Today’s modern object stores are typically accessed via a REST API, which assures that the system will be open and the data accessible to a broad range of applications. But if you’re doing big data analytics and trying to iterate rapidly, the idea of extracting data via a Web service call sounds farfetched.

One of Hadoop’s strengths is how it lets you bring the compute to the data, but object stores rely on fast networks to move lots of data to the compute. That architecture reflects traditional HPC used in supercomputing sites, not modern, Web-scale systems.

There are several projects in the works that seek to combine the power of both approaches. One of these is Ozone, an object store designed to extend HDFS to support the concept of “bucket spaces.” Hortonworks launched Ozone last year and the project is now incubating. Storiant is also working with the Hadoop distributor to make the object store look like HDFS, thereby enabling users to work with the data as it sits in the object store.

Related Items:

What Lies Beneath the Data Lake

Software-Defined Storage Takes Off As Big Data Gets Bigger

Rebuilding the Data Center One Block At A Time

Applications: Data Mining

Technologies: Cloud, Middleware, Storage

Sectors: Biosciences, Financial Services

Vendors: altiscale, Hortonworks, Storiant

Tags: data lake, HDFS, object store

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Data Lake Showdown: Object Store or HDFS?

Why You Should Use Object-Based Storage

Why You Should Use Hadoop for Your Next Data Lake

Where We Go From Here

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 17, 2024

April 16, 2024

April 15, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Data Lake Showdown: Object Store or HDFS?

Why You Should Use Object-Based Storage

Why You Should Use Hadoop for Your Next Data Lake

Where We Go From Here

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 17, 2024

April 16, 2024

April 15, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link