Delivering on the Data Lake Promise
The fundamental promise of a data lake is that it will give business users better access to the data they need – securely, accurately and without the long lead times traditionally associated with transforming that data into a physical data model. That’s a big challenge, however, and many data lake projects fail.
To deliver this level of self-service access to data that enterprises require, a successful data lake needs to ensure that data delivered to users through the lake complies with the following key tenets:
- Incoming data must be thoroughly validated;
- A managed history must document the origin, evolution, and meaning of each data entity;
- The data needs to be richly profiled so users can easily see the content, completeness, quality, and potential applicability of each set of source data;
- The data must be fully secured.
If the power of producing (including onboarding and ingesting) data can be shifted to subject matter experts rather than “tool developers,” and you apply these four tenets, then the consumers will have an unmatched level of confidence when accessing the data.
Certainly, this is easier said than done, as evidenced by the many reported challenges of implementing a data lake. In truth, data quality issues have been around for decades and are neither new to nor specific to data lakes. Data lakes, however, make fixing data quality issues both more urgent and more challenging for a few reasons:
- Hadoop alone offers nascent capabilities for data onboarding and validation. These capabilities do not at present meet requirements of an enterprise scale data management environment.
- It is essential that all data is validated as it is ingested into the lake. Many organizations however struggle to define and implement a robust validation process.
- Many data lake projects rely on programmers, rather than subject matter experts to validate data.
- If you don’t secure the data in the lake, few people will be allowed to use it. So making sure the data lake is well secured and governed is essential to the success of the project.
We will look at each of these assertions in more depth.
It’s Not Rocket Science (But It’s Close)
Over the last two decades, data validations have typically been executed as a part of the extraction/transformation/load (ETL) process when populating a data warehouse or data mart. These validations were usually based on an understanding of what the data should mean and how it should be organized, and were implemented by ETL programmers. Data validation, through ETL processes, was a standard part of the data warehouse project.
With data lakes, however. data validation on ingest is not automatic, not standard practice, and often is overlooked. Also, critical statistical profiling of the data entering the lake seldom takes place. Most project teams either don’t have the time for statistical profiling or believe they know enough about the data to make profiling unnecessary.
It is this decision–to skip automatic data validation and profiling during the ingest process–that drives the failure of many data lake projects. Data in the lake is not validated and profiled, which results in what is basically a “black box” or questionable quality and usefulness. This is nearly always tied to delays in making data available and (ultimately) the delivery of bad data. This erodes consumer confidence to the point of non-use.
Data Validation and Profiling Are Essential
In practice however, validating all data entering the lake is hard. Each data source features unique technical challenges and a particular set of data issues, necessitating the development of specific data validation functions for each source. Given tight project timelines and the thousands of data sources that need to be loaded into the lake in a typical organization, typical SLAs require that the validation process be executed automatically and efficiently.
And there are a lot of data problems to solve during the ingest, validation, and profiling stage. Here are a few examples just to illustrate the scope of the effort:
- Complex data such as Cobol or XML files that contain nested or hierarchical structures need to be flattened or normalized. Similarly, multiple record types within a single file must be identified and ingested accurately for their specific format.
- Garbage data such as “control characters,” embedded newlines and embedded delimiters need to be found and fixed.
- Data in EBCDIC or Packed-Decimal format, needs to be converted into a UTF-8 standard compliant format.
- The COBOL copybooks or their equivalent need to be converted into an accurate and complete HCat schema.
- Headers and trailers for each file, which often contain validation data (record count, table schema, business data), needs to be verified during the load, and then stripped from the data set prior to querying and analysis.
- XLM data needs to be normalized and “Keyed.”
It is common to think: “Let’s just copy data into HDFS and we will have a data lake!” But by the time teams come to terms with the extent of data problems in the lake, it is often way too late to fix those issues without massive slips in the project timeline and delivery dates.
Finally, the process of ingesting data into the lake (or as we like to refer to it, producing data) has to be done within the context of a robust security model that allows the producers of the data to apply a corporate security model to the data. We will provide more details on this later.
Producing data for the data lake can be a complex process fraught with risks that can sink a project with delays, cost overruns and failure. Sure, the promise open source projects such as Nifi, Atlas and Kafka, to name a few, make it seem easy to onboard data, but remember: without addressing data validation and profiling upon ingest in an efficient way, the consumers will never be sure about what they have.
It May Not Matter If You Don’t Secure It
By definition a data lake involves putting enterprise data within arms’ reach of lots of consumers and giving those users self-service, on-demand access to that data. Given this context, the potential risk that people might have access to data they shouldn’t or that an enterprise data breach or inappropriate disclosure of personally identifiable information data might occur is significant. To be an enterprise ready from a security perspective, a self-service data lake needs to achieve three goals:
1. Leverage and comply with the organizations’ existing security policies
Compliance with existing enterprise security starts with honoring the organization’s authorization and authentication processes, via Active Directory and Kerberos or whatever process or enterprise has standardized on. It also means recognizing and enforcing all data access constraints dictating which data each particular user is authorized to read, update or share. This is particularly important as the data lake ingests and produces data from sources or publishes data out to other applications or users.
Enterprise readiness from a security perspective also requires that the data lake support data encryption and obfuscation, at a field level. The lake should not only seamlessly recognize and maintain encryption and obfuscation of data entering or leaving the lake but also enable users to add addition encryption or obfuscation to data in the lake as needed.
2. Continuously integrate emerging security measures available through the Hadoop communityas they mature
Users working with data in the lake should be constrained in what data they are allowed to access by file and directory level access constraints established at the HDFS level. When an organization is using other Hadoop-specific security measures like Ranger, Sentry or RecordService, the data lake also needs to integrate with those tools. Finally, the data lake needs to support impersonation to create failsafe transparency and auditability at the HDFS file level regarding exactly which users have had access to exactly what data in the lake over time.
3. Make is easy for administrators to implement security
The power of self-service access to data in the data lake era is obvious and will grow significantly as both producers and consumers leverage the capabilities outlined in this article. As this virtuous cycle expands and more users come to the data, the task of securing data in the lake grows as well.
To meet enterprise scale needs, the data lake needs to give administrators easy ways to recognize new users, assign them to user groups, and give them access to the right data. Reporting around users, data access patterns, performance, and data volumes should allow administrators to quickly and easily understand what’s happening in the lake and what steps should be taken to improve efficiency and maintain security. Users for their part should be prevented from causing security problems by automatically and effortlessly being subject to all of the security provisions in the environment whenever they are working in the data lake.
In part one of this series, we said that an enterprise data lake is more than just a set of data in Hadoop. It is an enterprise-scale data management platform – a marketplace that brings together data producers and data consumers in a dramatically new data-as-a-service model.
But to service this new role, the data lake needs to deliver truly enterprise-scale capabilities. As we have laid out here, this is certainly the case with respect to on-boarding data into the lake and the underlying security model.
In our third and last part of this series, I’ll talk about two more aspects of what it takes for a data lake to be truly enterprise ready: integration of the data lake with other applications and systems in the enterprise IT landscape, and data governance.
About the author: About the author: Bob Vecchione is the co-founder and chief technologist at big data analytics software provider Podium Data. Bob is recognized as an industry leader in the design, architecture and implementation of large-scale data systems. His more than two decades of experience includes working for Prime Computer, Thinking Machines, Strategic Technologies & Systems, Knowledge Stream Partners, as an independent data systems architect and now, Podium Data. He hold a degree in electrical engineering from University of Massachusetts at Lowell.
July 19, 2018
July 18, 2018
July 17, 2018
- Lucidworks Recognized as a Leader in Gartner Magic Quadrant for Insight Engines
- Nlyte Software Releases First Cognitive DCIM Solution Powered by IBM’s Watson IoT
- BOHH Labs and Teradata Partner to Securely Expand Business Data Analytics Capabilities
- Spectra Logic Announces Enhancements to Tape Library Offerings
- New MarkLogic Service Automates Query Capacity
- Splice Machine Launches New Apache Ranger Plugin
- DB-Engines’ Latest Industry Results Reveal InfluxData’s Leadership Position in Time Series Database Category
July 16, 2018
- Periscope Data Announces Machine Learning Solution Leveraging Amazon SageMaker
- Constellation Research Recognizes MapR-DB for Its Architecture for Next-Gen Apps
- Big Data a $3.3 Billion Opportunity in the Automotive Industry, Says SNS Telecom & IT
July 13, 2018
July 12, 2018
- DataStax Appoints Bethany Mayer to Board of Directors
- Alation Introduces Agile Information Stewardship with the Release of TrustCheck
- Accenture Expands Touchless Testing Platform with AI Technology from Real Time Analytics Platform, Inc.
- Accenture Acquires Kogentix to Help Clients Run Legacy Analytics Applications on Open Source Technologies
July 11, 2018
- Study Says Despite Benefits, Most Brands Not Applying Customer Analytics in Real Time
- SAS Launches SAS 360 Plan for Marketing Planning
Most Read Features
- Can Markov Logic Take Machine Learning to the Next Level?
- Which Programming Language Is Best for Big Data?
- 9 Must-Have Skills to Land Top Big Data Jobs in 2015
- Why 2018 Will Be The Year Of The Data Engineer
- A Look at the Graph Database Landscape
- Which Type of SSD is Best: SATA, SAS, or PCIe?
- Q&A With a MongoDB CTO
- Apache Spark: 3 Real-World Use Cases
- Which Machine Learning Platform Performs Best?
- Why Knowledge Graphs Are Foundational to Artificial Intelligence
- More Features…
Most Read News In Brief
- California’s New Data Privacy Law Takes Effect in 2020
- Why Gartner Dropped Big Data Off the Hype Curve
- How Blockchain Is Impacting Data and Processes in Insurance
- DataTorrent, Stream Processing Startup, Folds
- Machine Learning ‘Arms Race’ Ahead, McAfee Warns
- Gartner Sees Analytics Boom as More Data is Shared
- Spark Leads Big Data Boom, Researcher Says
- A MongoDB Secret Weapon: Aggregation Pipeline
- MongoDB Goes Pluggable with Storage Engines
- Tech ‘Democratization’ Seen Fueling Analytics Boom
- More News In Brief…
Most Read This Just In
- Trifacta Named No. 1 Data Preparation Technology in Ovum Decision Matrix for Self-Service Data Preparation
- Teradata Files Suit Against SAP
- Trifacta Accelerates Data Preparation for Regulatory Reporting Among Leading Financial Institutions
- Wave Computing Acquires MIPS Technologies
- DDN Announces A3I Solutions: Parallel Data Platforms for AI
- Policy Arm of ACM Calls on Congress to Enact Comprehensive Consumer Privacy Protections
- John Snow Labs’ Natural Language Understanding Software Gets ‘State of the Art’ Recognition in Three Industry Events
- Accenture Acquires Kogentix to Help Clients Run Legacy Analytics Applications on Open Source Technologies
- BlueData Introduces New Innovations for AI and Machine Learning in Hybrid or Multi-Cloud Deployments
- H2O.ai Helps Stanley Black & Decker Develop Innovative Manufacturing Processes
- More This Just In…
July 22 - July 26Pittsburg PA United States
August 15 @ 8:00 am - August 17 @ 5:00 pmSeattle WA United States
August 29 @ 8:00 am - August 31 @ 5:00 pmVancouver Canada
September 4 - September 5Malaysia
September 13New York NY United States