Follow Datanami:
November 24, 2014

The Aspirational Data Lake Value Proposition

Tripp Smith

The industry hype around Hadoop and the concept of the Enterprise Data Lake has generated enormous expectations reaching all the way to the executive suite. Yet, when trying to establish a Modern Data Architecture, organizations are vastly unprepared how to house, analyze and manipulate the massive quantities of data available. All too often, they believe the only requirement is to download the Hadoop software, install it on a bunch of servers, commence loading the Data Lake, unplug the enterprise Data Warehouse (EDW) and voila; the profits will come in faster than the tide in a hurricane.

The reality is that the emerging “Data Lake” architecture paradigm – which in a nut shell is a framework of a large object-based storage repository that holds data in its native format until needed – oversimplifies the complexity of enabling actionable and sustainable enterprise Hadoop largely because of a few early mishaps:

  • Hadoop emerged from immature industries: With little focus on governance, compliance, and business continuity “move fast and break things” is being replaced with “move fast with stability”.
  • Industry perspectives had the wrong focus: Instead of focusing on governance and extensibility, pundits focused on access and scale. This forced analysts to spend more time on data forensics than data analysis.
  • Strategy became an after-thought: Experience shows that diving into a Data Lake without a strategy for maturing at scale is more likely to create a “Data Swamp” that fails to deliver on the value proposition of Hadoop and generates countless processes that become “Swamp Monsters” requiring expensive ongoing maintenance.
  • An over-all lack of governance: Introducing a Hadoop ecosystem without governance generates a morass of challenges for both IT and the business, including a lack of platform security and repeatable processes for securing data, a common vocabulary and business data definitions, and transparent data quality and data lineage. It also creates an inability to manage complex mixed workloads and a variety of access patterns to support disparate user groups and use cases.

The MESH Framework

A Mature Enterprise Strength Hadoop (MESH) Framework addresses scaling a Hadoop ecosystem that addresses enterprise needs. The MESH Framework establishes a matrix of architecture, governance and enablement capabilities that scale with the platform to organically promote administration and management as well as access and extensibility.

Increasingly, Hadoop is not a single environment, but an integrated ecosystem addressing a variety of use cases. MESH provides an architectural methodology for data architecture, infrastructure sizing, component integration and tool selection to maximize extensibility while controlling the diversity of the IT portfolio.

A MESH framework also incorporates proven enterprise-strength security with field-hardened reference models for leading platforms addressing access, authentication, authorization, data security and resource management.

Skilled Hadoop resources are also in very short supply. A MESH framework provides automation and acceleration across governance and enablement vectors to reduce development time and focus analyst efforts on surfacing insights, not data forensics.

However, governance is not a one-time set it and forget it exercise. A MESH framework provides a roadmap and tools to activate Hadoop value within the context of mature enterprise analytic capabilities.

As the Hadoop ecosystem has matured and gained increasing traction in mature industries, it is clear that the potential benefits for data management and analytics are staggering. Some estimates claim that Hadoop is 25 times less expensive per terabyte than leading proprietary RDBMS. The industry hype around Hadoop and the concept of the Enterprise Data Lake has generated enormous expectations and organizations are quickly taking the plunge. Experience shows that by starting with a rigorous framework that readily encourages information agility and enablement rather than suppressing opportunity, organizations can avoid turning their Data Lake into a Swamp. More importantly, they can finally jump into the Modern Data Architecture pool and deliver on Hadoop’s true value proposition.

About the author:  Tripp Smith is CTO at Clarity Solution Group, a recognized data and analytics consulting firm. Contact him at www.clarity-us.com.

Related Items:

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real-Time

If You’re Missing Fast Data, Big Data Isn’t Working for You

Are Data Lakes All Wet?

Datanami