Follow Datanami:
May 16, 2016

Data Management and Governance in the Data Lake

DataLakeA data lake is a central location in which to store all your data, regardless of source or format. It is typically built using Hadoop. The data can be structured or unstructured. You can use a variety of storage, analytic and processing tools to extract value quickly to inform key organizational decisions.

Because all data is welcome, data lakes are a powerful alternative or complement to a traditional Enterprise Data Warehouse. In addition, data lakes are a leading choice as organizations turn to cloud-based applications and IoT.

In early use cases, organizations frequently loaded data into the data lake without attempting to manage it.  As data lakes mature and become more strategic to an organization, it is no longer sufficient to dump the data into the data lake and hope for the best.

A data lake is flexible, scalable and cost-effective. But it can also possess much of the discipline of a traditional EDW if you add data management and governance capabilities such as data quality, metadata management, security, transformations and the ability to subset or combine data. The data lake can, when managed correctly, improve your existing data initiatives and enable new initiatives. Your organization can choose one of four paths when building a data lake:

Option 1: Address governance later

The first option is to ignore governance and load data freely into the lake. Later, when you need to discover insights from the data, you will have to find tools to clean the data, such as machine-learning techniques. There are real risks to this approach. Even the most intelligent inference engine needs to start somewhere in the massive amounts of data in the lake. Inevitably parts of your data lake will be ignored, become stagnant, isolated, and contain data with so little structure that even the smartest automated tools—or human analysts—won’t know where to begin.

Option 2: Adapt existing legacy tools

You can leverage applications and processes that were originally designed for the EDW. Software tools are available that perform ETL processes you used when importing clean data in the EDW. You can use these tools to import data into the lake, however, it is costly, and only addresses a portion of management and governance functions you need. Another drawback is that ETL happens outside the Hadoop cluster, slowing down operations and adding cost, as data must be moved outside the cluster for each query.

Option 3: Write Custom Scripts

With the third option, you build a workflow using custom scripts that connect processes, applications, quality checks, and data transformations to meet governance needs. This is a popular choice but is the least reliable and the most resource intensive. You need highly skilled analysts steeped in Hadoop and its ecosystem to leverage open-source tools, and they need to write scripts to connect the pieces. This process gets more time-consuming, and costly, as you grow the lake because you have to constantly revise complicated code and workflows.

Option 4: Deploy an Integrated Data Lake Management Platform

The fourth option is to incorporate a data lake management platform that has been purposely built to ingest and manage large volumes of diverse data sets in the data lake. Zaloni’s Bedrock provides this capability. It allows you to catalog the data, leveraging metadata, and support the ongoing process of ensuring data quality, data lineage, and automating workflows. This approach is gaining ground as the optimal solution for data lake management and governance.

As you transition to a data lake, choosing a fully integrated data lake management platform will allow you to have confidence in your data, and scale it to incorporate more and more users and use cases that benefit the business. After all, that is what the data is for, to inform and improve decision-making processes across your organization and to help your business grow in new and exciting ways.

Learn more about Zaloni’s Bedrock data lake management platform, and how it can help you manage, govern and scale your data lake at www.zaloni.com.

Datanami