Follow Datanami:
February 16, 2015

Why ‘Data Lakes’ May Create Drowning Risks

Lionel Gibbons

Many organizations tackling Big Data projects find themselves swimming in uncharted waters, but the concept of a “data lake” may be at least one way to keep them from wading in too deep.

A data lake can be defined as an environment where a data warehouse resides within Hadoop. The idea is to bring greater efficiency to managing unstructured information. The trade-off is that those using the data lake approach are putting all of their eggs in one basket, which brings a number of potential risks and increased administration requirements. As a result, data lake adoption could emerge as one of the most critical decisions Hadoop users make in 2015.

Before moving to a data lake, firms should consider the staffing implications. Because data lakes are so unstructured, it often takes many people to manage them.

Single vs. Multiple Clusters

Hadoop adopters gravitate toward multiple clusters, the number of which will vary according to specific company requirements. Implementations could be as simple as separating development and test from production.

The types of clusters needed will also vary widely according to data processing requirements and other variables. Organizations require five types of clusters. This includes exploratory clusters for data they already have or can gather to look for trends (where there would be one per type of data per exploration group).

Other organizations might work towards what could be called a “total view” of the customer clusters — a Big Data set that holds all the information about a customer across multiple systems. Some clusters might be focused on securing Big Data or for organizing information from enterprise resource planning (ERP) systems. Then there are “super-data” clusters. These are the internal data lakes. It’s best to connect these and collect as much data as possible.

Other possibilities include a Hadoop cluster for data mining and exploration, and others to support specific production workflow scenarios.

However, a data lake scenario with a single cluster that is accessed by several applications is not without its issues. As a data lake grows, it becomes an increasingly critical resource feeding multiple downstream applications so it requires more advanced security, data governance, data management, and overall administration.

Data lakes in Hadoop present unique opportunities…and unique challenges. This is one of the reasons we included management support for Hadoop within Bright Cluster Manager. Bright uniquely addresses the challenges and opportunities associated with Hadoop’s data lakes. For example, Bright allows for the management of multiple Hadoop clusters from a single, easy-to-use GUI. Not only can Bright handle multiple clusters, but each cluster can be based on different Hadoop distributions — from vanilla Apache Hadoop, to the latest offerings from Cloudera and Hortonworks. Because Bright takes care of deploying, monitoring, and managing Hadoop from bare-metal servers, multiple data lakes can be supported with minimal additional administrative overhead for IT personnel.

Data Warehouses within Hadoop

There are obvious advantages to staging or preparing data for downstream use in traditional warehouses. For example, you can control the size of the data warehouse, reducing the costs of other licensed data warehouse solutions.

In addition, raw data can be retained in Hadoop for additional processing to extract more details. You can search for patterns in the low-level data at any point in the future, and perform all these tasks in real time.

Rapid Adoption of Data Lakes

Reliance on a single platform may well be counterproductive, as there are other issues to consider that will make IT think twice before adopting a data lake environment. Despite the replication capabilities of the Hadoop Distributed File system (HDFS), real enterprise data protection features are not yet fully available in off-the-shelf Hadoop distributions. Enterprise IT folks also need snapshots, backup, and disaster recovery capabilities.

Some are looking for alternatives to HDFS. For example, Tulane University decided to make use of the Intel Enterprise Edition for Lustre in its converged infrastructure for High Performance Computing (HPC) and Big Data Analytics. The University uses Bright Cluster Manager to deploy, monitor, and manage solution stacks that include Cloudera CDH on Lustre instead of HDFS. Data lakes in the enterprise demand the managed flexibility in file systems through analytics apps that Bright can deliver.

Hadoop may well become the default analytical platform for multiple industries and will only grow in popularity, which means that a lot of important data will soon be swimming in data lakes on Hadoop clusters.

——-

Call to Action

http://info.brightcomputing.com/your-disposal-data-lakes

 

Datanami