Building the Enterprise-Ready Data Lake: What It Takes To Do It Right
The last year has seen a significant growth in the number of companies launching data lake initiatives as their first mainstream production Hadoop-based project. This isn’t surprising given the compelling technical and economic arguments in favor of Hadoop as a data management platform, and the continued maturation of Hadoop and its associated ecosystem of open source projects.
The value is undeniable: Providing a true “Data As A Service” solution within the enterprise has business users engaged, productive and driving immediate value. Cloudera and Hortonworks (NASDAQ: HDP) continued work around Atlas, Sentry, Ranger, RecordService, Knox and Navigator projects signal continued efforts to improve data security, metadata management, and data governance for data in Hadoop.
The problem is that despite these incremental improvements, Hadoop alone still lacks many of the essential capabilities required to securely manage and deliver data to business users through a data lake at an enterprise scale.
For example, the significant challenge and critical task of automatically and accurately ingesting data from a diverse set of traditional, legacy, and big data sources into the data lake (on HDFS) can only be addressed by custom coding or tooling. Even with tooling, understanding the challenges of data validation, character set conversions and history management, just to name a few, are often not fully understood, or worse; neglected all together.
The open source projects also don’t address providing business users an easy way to collaborate together to create and share insights about data in the lake through crowd-sources business metadata, analytics and data views.
Nor does Hadoop automatically support integrating the data lake with other enterprise applications through job schedulers or batch processing jobs in an automatic, lights out mode. Finally, despite efforts around Sentry and other projects mentioned above, Hadoop still does not secure data in HDFS at a level consistent with requirements of a typical Fortune 500 company
For companies looking for a commercial approach to building their data lake, there are a variety of options available, including data wrangling tools, legacy ETL vendors, and traditional providers of database and data warehouse platforms. These solutions vary widely in terms of functional completeness, native Hadoop support, and most importantly: enterprise readiness.
This article will explore in detail what it takes to make a data lake truly enterprise ready. Looking at the entire data lake process, we will look at enterprise readiness in terms of data ingestion, profiling and validation as well as data preparation and delivery. Since data lakes are built by teams and accessed by large groups of business users, this article will address some of the complexity and control issues associated with managing expanding and changing sets of data files in the Lake. Other topics will include enterprise scale data governance, security, and metadata management and the integration of the data lake with other systems in the overall enterprise application and data architecture.
In this first installment we’ll look at the data layer underpinning the data lake and ask what are the best ways to manage data to support an enterprise scale data lake deployment.
Data, Data Everywhere and Not A Drop of Trusted Data To Drink
The ever growing list of services associated with Hadoop is providing users with a variety of ways to access data. Unlike an RDBMS, when data is available in Hadoop (on HDFS), there is no guarantee that the data, at a record level is “good,” or consumable. Scarier still, when accessing this data (whether its with Hive, Spark, Impala, etc.) the user will not know that the data may be “bad.” This is one of many challenges in what we call on-boarding data. Other important considerations include: organization and history management. Specifically, you must consider:
- A consistent way of assigning meaningful conventions;
- A rational directory structure that organizes files in a way that matches the flow of data through the data lake process at the application layer;
- Automatic partitioning of data in tables to optimize performance and manage history;
- An easy way to identify data usage patterns and remove unused data as needed.
Let’s look at each of these requirements in turn.
Bringing data into the lake, is easy. Right? You just copy the data using Sqoop or copyFromLocal for example. The reality is that there are no magic bullets that will validate this data, tell you what you have, organize it, and manage it. The fact that it is easy to copy data into the lake leads to chaos in the data layer as the data lake grows and the number of files and users expands. Each data source imported into the lake creates new files, during the original on boarding and with each subsequent update. More files are generated through data validation, profiling, cleaning, transformation, integration and preparation stages.
This explosion in the number of files in the lake is accelerated because, in Hadoop, files are never updated, just recreated; any change to data in a file always results in the creation of a new file. As a result, the number of files in the lake might starts small, but it can get big fast, creating the data management equivalent of kudzu.
Let’s drill down into the specific ways you can keep the kudzu out.
1. Make The Data Layer Decipherable
Simple concepts such as file naming can also be a problem. While users working directly with Hadoop are free to name files whatever they want, in a data lake file names can communicate something about the data itself, such as the source, lineage, or create date.
As data moves through the data lake process and is enhanced, transformed, and copied, smart file names can document aspects of that process and create a transparent record of that data’s journey from one file to the next.
2. Keep The Lake Clean
HDFS’ hierarchical directory structure offers the same challenge and opportunity. While working directly with Hadoop users can put files wherever they want – in whatever directories they choose on the cluster.
However, in a data lake there are real advantages to using a systematic set of well named and organized directories and sub directories to group files in a way that matches how data is manages in the data lake itself.
3. Partition for Performance
Over time files in the data layer of a data lake can also become very large and inefficient. Large sets of data, loaded into HDFS and into the lake directly through Hadoop, are not by default partitioned for optimal performance. For example, a daily load of update data from a enterprise application acting as a data source into the data lake, will not by detail be partitioned by the create data for each record in the file.
Business users accessing data in the lake based on some date specific criteria (“Show me all new sales transactions from yesterday”), might get very slow response times as the data lake application, working through HSFD, has to search through many files across many nodes in the cluster to find all records with that date stamp. To be enterprise ready, the data layer in the data lake needs to automatically partition data as it is written into files in HDFS to anticipate users likely query matters and optimize performance.
4. Identify and Eliminate Data Exhaust
Finally, there is the risk that over time the data lake can become a vast parking lot for data exhaust, i.e. data that was originally loaded into the lake with the belief that it would be useful and used, but which over time has proven to be not needed, not accessed and not worth keeping. Data exhaust isn’t just messy. It’s also distracting (what is this stuff?), wasteful (why are we storing all this junk?) and potentially risky (why are we keeping this sensitive information needlessly here where it can get stolen?).
To be enterprise-ready, data lake’s need to enable identification and elimination of data exhaust by giving data lake administrators a easy way to survey the total contents of the lake, identify which files in the data layer are not being accessed, and remove them efficiently and regularly.
It’s tempting to think that Hadoop, with its flexible file management system and robust high performance map/reduce processing model, is all you need to mange large volumes of data in the data layer of an enterprise scale data lake. But as we have seen the openness and flexibility of Hadoop can also lead to chaos, inefficiency and overgrowth as the lake grows with more data and more users.
To be truly enterprise ready, data lakes need to supplement and extend Hadoop with data management capabilities that organize and maintain the growing set of data files in the data layer underpinning the lake.
In the next installment in this three-part series, we’ll look at two other aspects of what it takes to make a data lake enterprise ready: How To Architect The Data Lake to Meet Corporate Security Standards and How To Bring Data Into the Lake Accurately and Efficiently.
About the author: Bob Vecchione is the co-founder and chief technologist at big data analytics software provider Podium Data. Bob is recognized as an industry leader in the design, architecture and implementation of large-scale data systems. His more than two decades of experience includes working for Prime Computer, Thinking Machines, Strategic Technologies & Systems, Knowledge Stream Partners, as an independent data systems architect and now, Podium Data. He hold a degree in electrical engineering from University of Massachusetts at Lowell.