Follow Datanami:
February 17, 2015

Plugging Leaks in Big Data Lakes

The big data lake phenomenon is in full swing at the moment, with Hadoop playing a central role in the storage and processing of massive amounts of data. But without certain processes in place, a data lake will not stand the test of time. Unfortunately, most of those processes must be implemented manually today.

People today are expecting too much out of Hadoop, and therefore setting themselves up for failure. While Hadoop provides the basic structure for storing and analyzing vast amounts of data, it’s still quite green and rough around the edges when it comes to the types of features that hundred-billion-dollar corporations expect out of their IT investments.

For example, there are big banks that have successfully proven Hadoop’s worth for all sorts of uses, such as identifying fraudulent transactions, reducing customer churn, and targeting new opportunities. However, because of Hadoop’s poor data security and the tendency of banks to be risk-averse, these clusters have not been moved to production. The Hadoop community and vendor ecosystem is moving quickly to harden the products that will address the needs. But today it’s largely up to the Hadoop user herself to ensure that good data practices are implemented in a data lake.

One of the companies that looking to fill the gap in data lake services is Teradata, which today unveiled a data lake assessment service that will be offered by its Think Big Analytics subsidiary. As part of the offering, Think Big will analyze a client’s Hadoop-based data lake and ensure that it’s properly configured and structured to ensure good governance and data integrity. Technicians will show the client how to implement a new data stream that feeds the lake without compromising its structural integrity.

“I don’t care if it’s a lake or warehouse, any time you’re dealing with data, there are important things to consider: governance, lineage, metadata, scalability, security, and archival,” says Chris Twogood, vice president of product and services marketing for Teradata. “We think there’s going to be challenges in data lakes growing if they haven’t built in some of these elements to really shore it up so it’s available to scale.”

The promise of Hadoop is definitely resonating with businesses, and the timing of the rise of Hadoop couldn’t be better. But there are inherent difficulties in handling unstructured and semi-structured data, and today’s Hadoop data lake practitioners face a very real threat of misplacing or otherwise corrupting that data.

“What happens is, people say ‘It’s just Hadoop, dump the data in and make it available to everyone,'” Twogood tells Datanami. “That’s where the whole data lake promise is going to fall down. If you don’t ensure that you’ve establish good governance practices for getting data in from an ingest stream, if you don’t create good metadata, and understand linage, if you haven’t set up strong security, as well backup and archive, you’re not going to have a platform that’s going to scale.”

Teradata is looking to parlay decades’ worth of experience and knowledge from data warehousing into the brave new world of Hadoop-based big data analytics. Not all of the lessons are one-to-one matches with today’s reality, and that’s where the company is shoring up its offerings through acquisitions. The Think Big Analytics purchase appears to be working out well for Teradata, especially in light of the skills crunch that the big data rush has exposed. The July 2014 acquisition of Revelytix is also working in vendor’s favor.

The big get with Revelytix was Loom, a data munging tool that helps companies get a handle on their big data sets, including adding metadata and performing light transformations. Teradata is gearing up to launch Loom version 2.4, which will automatically scan all incoming data being added to the data lake. Also added is support for the JSON data type, which will make the tool even more relevant for understanding Web data.

Watching all the struggles around Hadoop makes the folks at Teradata feel like they’ve been here before. “All of these best practices, they feel a little like déjà vu because we’ve been doing this forever in the data warehousing space,” Twogood says. “The first time [your Hadoop data lake] dies, people; are going to say, Are you kidding me, why weren’t you establishing some of the basic principles of data management? We’re just taking all this stuff we’ve done and learned from data warehousing and applying it to the data lake.”

Related Items:

The Land of a Thousand Big Data Lakes

Are Data Lakes All Wet?

Rise of the Big Data Engineer