February 17, 2015

Plugging Leaks in Big Data Lakes

Alex Woodie

The big data lake phenomenon is in full swing at the moment, with Hadoop playing a central role in the storage and processing of massive amounts of data. But without certain processes in place, a data lake will not stand the test of time. Unfortunately, most of those processes must be implemented manually today.

People today are expecting too much out of Hadoop, and therefore setting themselves up for failure. While Hadoop provides the basic structure for storing and analyzing vast amounts of data, it’s still quite green and rough around the edges when it comes to the types of features that hundred-billion-dollar corporations expect out of their IT investments.

For example, there are big banks that have successfully proven Hadoop’s worth for all sorts of uses, such as identifying fraudulent transactions, reducing customer churn, and targeting new opportunities. However, because of Hadoop’s poor data security and the tendency of banks to be risk-averse, these clusters have not been moved to production. The Hadoop community and vendor ecosystem is moving quickly to harden the products that will address the needs. But today it’s largely up to the Hadoop user herself to ensure that good data practices are implemented in a data lake.

One of the companies that looking to fill the gap in data lake services is Teradata, which today unveiled a data lake assessment service that will be offered by its Think Big Analytics subsidiary. As part of the offering, Think Big will analyze a client’s Hadoop-based data lake and ensure that it’s properly configured and structured to ensure good governance and data integrity. Technicians will show the client how to implement a new data stream that feeds the lake without compromising its structural integrity.

“I don’t care if it’s a lake or warehouse, any time you’re dealing with data, there are important things to consider: governance, lineage, metadata, scalability, security, and archival,” says Chris Twogood, vice president of product and services marketing for Teradata. “We think there’s going to be challenges in data lakes growing if they haven’t built in some of these elements to really shore it up so it’s available to scale.”

The promise of Hadoop is definitely resonating with businesses, and the timing of the rise of Hadoop couldn’t be better. But there are inherent difficulties in handling unstructured and semi-structured data, and today’s Hadoop data lake practitioners face a very real threat of misplacing or otherwise corrupting that data.

“What happens is, people say ‘It’s just Hadoop, dump the data in and make it available to everyone,'” Twogood tells Datanami. “That’s where the whole data lake promise is going to fall down. If you don’t ensure that you’ve establish good governance practices for getting data in from an ingest stream, if you don’t create good metadata, and understand linage, if you haven’t set up strong security, as well backup and archive, you’re not going to have a platform that’s going to scale.”

Teradata is looking to parlay decades’ worth of experience and knowledge from data warehousing into the brave new world of Hadoop-based big data analytics. Not all of the lessons are one-to-one matches with today’s reality, and that’s where the company is shoring up its offerings through acquisitions. The Think Big Analytics purchase appears to be working out well for Teradata, especially in light of the skills crunch that the big data rush has exposed. The July 2014 acquisition of Revelytix is also working in vendor’s favor.

The big get with Revelytix was Loom, a data munging tool that helps companies get a handle on their big data sets, including adding metadata and performing light transformations. Teradata is gearing up to launch Loom version 2.4, which will automatically scan all incoming data being added to the data lake. Also added is support for the JSON data type, which will make the tool even more relevant for understanding Web data.

Watching all the struggles around Hadoop makes the folks at Teradata feel like they’ve been here before. “All of these best practices, they feel a little like déjà vu because we’ve been doing this forever in the data warehousing space,” Twogood says. “The first time [your Hadoop data lake] dies, people; are going to say, Are you kidding me, why weren’t you establishing some of the basic principles of data management? We’re just taking all this stuff we’ve done and learned from data warehousing and applying it to the data lake.”

Related Items:

The Land of a Thousand Big Data Lakes

Are Data Lakes All Wet?

Rise of the Big Data Engineer

Applications: Data Mining

Technologies: Frameworks, Middleware

Sectors: Financial Services, Government, Healthcare, Retail

Vendors: Teradata

Tags: data lakes, Hadoop, Teradata

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Plugging Leaks in Big Data Lakes

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Plugging Leaks in Big Data Lakes

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link