Can Data Lakes Survive the Data Silo Onslaught?
Data lakes are hot. There’s no doubt about that. But will the data lake concept hold water in the long term? The jury is still out on that question, as the number and type of databases proliferates, as big Hadoop environments get difficult to manage, and the difficulty in bringing it all together continues to thwart all-encompassing analytical efforts.
First, some numbers. According to a report released today by Research and Markets, the data lake market is projected to drive nearly $9 billion in revenue by 2021, up significantly from the $2.5 billion in spending they’re expected to drive this year. That 30% compound growth rates shows the data lake concept has momentum in the short- and medium-term.
There’s good reason for this growth. Instead of spending lots of time and money to build data warehouses and ETL pipelines to answer a set of specific questions from structured data, as we did in the past, data lakes flip the cost structure around. With a data lake running on Hadoop (or less commonly, on a NoSQL database or an object-based storage systems), we can throw all types of data into a single repository, and worry about making sense of the data later.
The data lake idea makes a lot of sense. As data volumes increase, it becomes prohibitively expensive to move the data from where it originated to different clusters for analysis. Nobody likes ETL – it’s difficult and brittle and slow and expensive. If one were to develop a big data from scratch, it makes much more sense to bring the compute to the data, rather than the other way around. Isn’t that the promise of Hadoop?
While data lakes sound great in theory, there are some problems with data lakes as they exist in the real world. While nobody seems to be championing the proliferation of data silos, the fact remains that data storage is becoming more fragmented, not less fragmented, as time goes on and the rate of data growth accelerates.
A Hole in the Bucket
Some in the industry are starting to point out flaws in the data lake theory, including Marc Linster, the senior VP of products at services at EntepriseDB, the company behind the open source Postgres RDBMS.
As NoSQL databases have proliferated in recent years, it has exacerbated data management issues, not helped them, says Linster, who cites a 2013 Gartner study that found the proliferation of NoSQL databases will hurt corporate data governance efforts.
“These data islands have quickly developed,” Linster tells Datanami. “Very few companies have the time to actually bring everything into one system. They don’t have the necessary resources and timeframes that are required to do the classical merging of the data, the landing and staging and the integrating of the data. That is not there anymore.”
Instead of building massive, singular data lakes, EnterpriseDB encourages its customers to adopt the federated style of storage and processing, which encourages data to reside where it fits best, while using software to connect the sources.
“Just accept that data as it’s put into certain places, and that’s where it’s being managed,” Linster continues. “But now you need to access it and you need to leverage it, and with a data adapter, we no longer are in a situation where you acutely need to transport that data or replicate it. Now you can just use it wherever it is.”
With a set of recently released “foreign data wrappers” that enable Postgres to access data to Hadoop, MongoDB, and MySQL, EntepriseDB is enabling organizations to build virtual data lakes, instead of actual ones. EnterpriseDB’s FDW for Hadoop (which is currently read-only) was checked by Hortonworks to ensure it was compatible with Spark. The FDWs for MongoDB and MySQL can read and write data.
EnterpriseDB isn’t the only company advocating the federated approach. Teradata (NYSE: TDC) has espoused this same exact approach with its QueryGrid capability, which enables customers to push queries from the Teradata environment down to Hadoop and other data repositories, and to bring the results back to Terada. HPE also enabled some push-down processing with the latest version of Vertica, which it’s in the processing of selling to Micro Focus.
Hadoop on the Spectrum
One of the recurring themes emerging from Hadoop’s ascent into the enterprise is the sheer difficulty in managing the infrastructure, let alone getting the requisite data science talent needed to turn massive sets of data into actionable business insight.
According to Ashish Palekar, the senior director of product management for the scale-out Isilon storage platform at EMC, the bigger companies are better equipped to get value out of Hadoop-based data lakes than their smaller breathren.
The bigger customers “have a lot of IT resources who are focused on operationalizing and making it simple,” says Palekar, who says that 15% to 20% of EMC’s 10,000 Isilon customers are running HDFS on their clusters. “But not every business has that capability. In fact I would say the majority of business don’t have the full capability of making this scale-up process operationaly simple enough.”
EMC sees value in both the physical data lake and the federated data warehouse models. But to get the full value out of data, it’s better to have it in one place, Palekar says.
“To really take advantage and get the benefit from the data you’re collecting, you need to manage and scale your environment,” he says. “Second you need your infrastructure to support not having multiple copies of data moving from one place to another. Third you need flexibility in terms of how your compute and storage are scaled up. We think the data lake concept gives you that advantage.”
Data Lake ROI
While the regimented data lake concept has its obvious advantages over the more natural siloed approach, many organizations are having difficulty managing their lake, and getting value out of them.
Ajay Anand, the VP of products at OLAP on Hadoop vendor Kyvos Insights, says the build-out of Hadoop-based data lakes is not going quite as fast as it was initially expected. “People are finding it hard to get the value out of Hadoop,” he says.
The Hadoop skills gap is definitely playing a role here, as well as the lack of easy-to-use tools from the Hadoop ecosystem, says Anand, who founded Datameer and was one of the engineers who worked on the first Hadoop cluster at Yahoo.
“We’re hearing from customers that they’re a little disappointed from the ROI that they got from initial data lake they have constructed,” Anand continues. “Some of those projects are slowed down. A lot of times they started creating these date data lakes and they would get business sponsors who would drive the proliferation through the enterprise. But the adoption by those users has been slow, for the reasons Gartner pointed out.”
Anand says BI software have to get better at quickly delivering deep insights across vast sums of data—which he says his company’s OLAP tool is able to do. “It’s not a problem with the data lake itself,” he says. “The ecosystem needs to provide that capability to make the data lake acceptable…The honus is on the Hadoop ecosystems to deliver things in a form where you don’t have to learn new skills. They should just be accessible to you.”
One of Kyvos’ customers in the financial services industry wanted to use Hado to analyze the combined risk across a variety of different asset classes, the data from which resided in separate systems. Only a centralized data lake could deliver that data in a timely manner, he says.
“It makes a lot of sense to bring that data together in one storage infrastructure and then be able to get a wholistic view of all the data and the total risk involved,” he says. “There’s a lot of value to get that symmetry of data from different sources together.”
In Anand’s view, data governance poses the biggest threat to data lake success, which is a view that is shared by others in the industry. This problem cropped up with the early days of Hadoop, when Yahoo’s machine learning experts asked the operations folks running Yahoo’s cluster what the data meant.
“When I was at Yahoo early on in 2007 we would just load all kinds of data into the Hadoop cluster, and when the business team — PhDs who are trying to get the right algorithm to analyze the data –needed information on what this data was, we would actually write down the metadata on pieces of paper and give it to them,” Anand says. “You’ve to give somebody guidance on the data…versus just saying, ‘Here’s vast lake of dark water where you don’t even know what’s in there.”
Nobody knows what the future holds, but chances look good that both approaches will exist. Organizations with the time, money, and discipline to construct their data lakes in an orderly fashion will gain the advantages of having vast amounts of varied data locally accessible. But at the same time, it seems likely that databases will continue to get more specialized over time, which will perpetuate the creation of independent data silos. As is often the case, there’s no one-size-fits all answer.