The Land of a Thousand Big Data Lakes
The prospect of storing and processing all of one’s data in an enterprise data lake running on Hadoop is gaining momentum, particularly when it comes to today’s massive unstructured data flows. However, given what we know of technological evolution and human nature itself, the chance of eliminating data silos and centralizing storage and compute is slim this big-data age.
Data lakes make a lot of sense conceptually. Instead of allowing silos to perpetuate, an organization pools all of its resources together into a giant shared repository for structured, semi-structured, and unstructured data. When data gets to a certain size, just moving it becomes a burden. It’s better to keep it all in one place where it can be managed, secured, and made available to users in a controlled and predictable manner.
But as attractive as the data lake proposition is, the whole thing melts down upon closer inspection. Yes, Hadoop’s rapid evolution is lowering the barriers of entry to the types of scale-out systems that companies like Google and Facebook use to run their businesses. The Hadoop stack is the embodiment of decade’s worth of data science and represents the future of big data analytics.
However, just as technological barriers have lowered and big storage needs have skyrocketed for entire companies and organizations, these same dynamics are occurring for individual departments and groups. And therein lies the dilemma. What makes sense for the company or organization also makes sense for smaller groups within.
Big corporations have been working to solve the problem of the proliferation of data silos and the lack of unified master data management (MDM) for the past 30 years. So why would one think they’ve now been sufficiently solved to the point where it’s feasible to build a single all-encompassing data lake that serves the entire company?
Gartner dismissed the data lake concept with a report earlier this year that suggested users beware of the “data lake fallacy.” “The need for increased agility and accessibility for data analysis is the primary driver for data lakes,” said Andrew White, vice president and distinguished analyst at Gartner. “Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
In lieu of the master data management (MDM) silver bullet, companies and organizations will continue to collect and analyze data in the same ad-hoc manner they’ve been doing all their lives, says Theo Vassilakis, the co-founder and CEO of Metanautix and one of the developers of Google’s Dremel, the distributed query engine that powers Google’s BigQuery.
“The message isn’t so much there won’t be lakes, but much more that each area [of the business] will want their own, just as they wanted their own warehouse and they wanted their own marts and things of that nature,” Vassilakis tells Datanami in an interview.
The big data boom is much more than Hadoop, and is a force that’s powering the burgeoning data economy. The more creative ways that groups and individuals can come up with to generate and consume data, the more successful that group or company is going to be in this emerging data economy.
“So many more parts of a business are creating data now,” Vassilakis says. “That can be as simple as making spreadsheets online or fielding polls on Survyemonkey or using a SaaS app or developing their own mobile apps. Even getting the sense of all the data that’s been generated in the enterprise is hard and it’s accelerating.”
But it would be a mistake to assume that all this data-centric work is going to be scripted or orchestrated from the top down in an organized and controllable manner. Instead, each group is going to lurch forward in haphazard fashion, following the mantra of continuous iterative development and “fail fast” that’s burned into the new data-centric economy.
The way Vassilakis sees it, all of this new data analysis work is not going to happen in a single data lake, and it’s not going to happen in just Hadoop. Sure Hadoop will be involved, but it’s also going to involve DB2 and mainframes and Oracle and Teradata and Google Analytics and Salesforce and wherever else the data resides.
“The analyst is going to want to join that piece of data they made with whatever their established thing is,” he says. “Are you now prepared to do the work of pulling it into a Hadoop cluster to combine it with the other data?…Our view of the dynamic is, chances are you’re probably not. If you have an easy way to access that bit of data, your path will be that.”
That decentralized data architecture was the norm at Google, where Vassilakis worked, and at Facebook, where Metanautix other co-founder, Toli Lerios, worked. The notion of developing a single centralized data lake is at odds with the practical boundaries and momentum of the workplace.
This tension is evident with SaaS vendors, such as Google Analytics. Google is more than happy to help you analyze your website traffic, provided you analyze the data in Google’s application. “It’s your data. You own it,” Vassilakis says. ” But you can’t download it. There’s no immediate provision to download it. Google will be happy to help you push it into Google Compute Storage and use Google BigQuery to analyze it. It’s a way to build on the competitive value you have by giving people a richer way to interact with the data you helped them create.”
Eventually that Google Analytics data needs to meet other data to get the highest value from of it. So either the company uploads other data into Google’s cloud, or Google lets the customer download some data to their premises. Or–more than likely—it’s all of the above.
“The dynamics between that SaaS provider and the customers are comparable to the dynamics of different divisions of a corporation and its EDW,” Vassilakis says. “That story will play out again and again because it’s the same actors, but in different roles. And how it ended last time was there was no central EDW. There were different data marts. There was some level of centralization for some kinds of things, and there was some level of fragmentation as well, and each business tolerated some different factor of those two endpoints, depending on how they needed to operate.”
Analysts will take the path of least resistance when merging data from different lakes or silos, and that’s where Metanautix’s new SQL-based tool, called Quest, comes in. The software functions a bit like a distributed ETL tool, and enables analysts to grab and merge the data they need when they need it. Quest is built atop a column-oriented database, but it doesn’t store any data beyond what it needs to cache to execute its SQL queries, and it will run wherever there are Java-compatible resources, be it Hadoop, an EDW, or even a mainframe.
Quest is able to gather and join data from thousands of machines. “One of the demos we do for people is we show them instances of Quest running in different places,” Vassilakis says. “We’ll show 1,000 clusters on AWS and 10 machines in our office and 100 machines on something else. Then we’ll run a single query that goes and hits each of those three clusters.”
In the real world, each of those three clusters might be run by different organizations. “And those organizations might not be prepared to give you their data outright maybe because they don’t want to or their polices or security requirements don’t let them,” Vassilakis continues. “But maybe they’re willing to let you run queries from time to time, provided they can see what the query is and can control how often you’re running it and they can block them and log them and be able to tell which queries you ran. We think these scenarios are going to be a lot more pervasive.”
Real life is messy and rarely unfolds the way we script it. That’s not to say we shouldn’t try to improve upon the past. But when it comes to the way we create and store and analyze data, don’t hold your breath waiting for a big data lake to solve your problems.
July 8, 2020
- Circonus Announces Free 45-Day Trial of its Kubernetes Monitoring Solution
- Talend Donates Nearly $3M in Data Skills Courses, Technologies to Higher Education
- HNI Corporation Taps Ascend.io to Fuel Operational Analytics
- GridGain Announces Nebula Managed Service For Apache Ignite and GridGain In-Memory Computing Platforms
- Data Mining System Unearths US Counties Most at Risk for COVID Deaths
- Researchers Create Online Resources for Data Exploration, Visualization, and Discovery for the Pan-Cancer Project
- Looker Now Available in the Microsoft Azure Marketplace
- Exasol Launches Exasol V7
- AIMART Enhances Investigative Intelligence Portfolio with Siren Partnership
- StataCorp Contributes Data Science Software to COVID-19 Research Database for Academic and Medical Research
- Petro.ai, AWS Partner to Operationalize Modern Analytics and Reduce Legacy Infrastructure Costs
- University of Kentucky Joins National Data Collaborative for COVID-19 Research
- Logical Clocks Joins European Initiative to Bring AI to 5G Networks with Hopsworks Feature Store
- Matillion Exchange Enables Faster Data Innovation with Community Contributed Pre-Built ETL Workflows
- Yellowbrick and Sonra Partner to Modernize Data Warehouses
July 7, 2020
- Dynatrace Expands AI-Powered Observability for Kubernetes Environments
- AtScale Expands COVID-19 Insights Model to Help Global Organizations Understand the New Normal
- Qubole Expands Integration with Informatica’s AI-driven Data Engineering Integration
- Virtual Esri User Conference to Explore How GIS Interconnects the World
- Dremio to Host Cloud Data Lake Conference
Most Read Features
- Big Data File Formats Demystified
- Nvidia Destroys TPCx-BB Benchmark with GPUs
- How to Build a Better Machine Learning Pipeline
- BI Tools — Are They Enough to Build a Data-Driven Culture?
- How COVID-19 Is Impacting the Market for Data Jobs
- Databricks Brings Data Science, Engineering Together with New Workspace
- Understanding Your Options for Stream Processing Frameworks
- SAS Provides Big Data Solutions for… Bees?
- Is Python Strangling R to Death?
- Databricks Cranks Delta Lake Performance, Nabs Redash for SQL Viz
- More Features…
Most Read News In Brief
- New Report Ranks Countries by COVID-19 Safety
- Spark 3.0 Brings Big SQL Speed-Up, Better Python Hooks
- IBM Brings Back a Netezza, Attacks Yellowbrick
- Blurred Lines: SAS and Microsoft To Go Deep in Analytics Partnership
- New Map Shows Hundreds of Counties in the COVID-19 Endgame — and Thousands on the Uptick
- NIH Launches Massive Initiative for COVID-19 Patient Data Analytics
- U.S. Special Ops Launches $600M Analytics Effort
- Researchers Explore Link Between American Individualism and Poor COVID-19 Response
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- War Unfolding for Control of Elasticsearch
- More News In Brief…
Most Read This Just In
- HSBC Joins Data Privacy Firm Privitar’s Series C Financing Round with $7M Investment
- D2iQ Unveils KUDO for Kubeflow to Accelerate Enterprise-Grade Machine Learning on Kubernetes
- SAS Debuts Tools to Gauge Risks and Impacts of Reopening
- Databricks Introduces Delta Engine, Acquires Redash
- Cloudera Debuts its Cloudera Data Platform Private Cloud
- The Linux Foundation Cloud Engineer Bootcamp Announced
- Technology Aims to Provide Cloud Efficiency for Databases During Data-Intensive COVID-19 Pandemic
- BP Invests $5M in Geospatial Analytics Software Company Satelytics
- Alation Launches Data Governance Initiatives
- New Actian Vector for Hadoop Enables Real-time and Operational Analytics
- More This Just In…