The Land of a Thousand Big Data Lakes
The prospect of storing and processing all of one’s data in an enterprise data lake running on Hadoop is gaining momentum, particularly when it comes to today’s massive unstructured data flows. However, given what we know of technological evolution and human nature itself, the chance of eliminating data silos and centralizing storage and compute is slim this big-data age.
Data lakes make a lot of sense conceptually. Instead of allowing silos to perpetuate, an organization pools all of its resources together into a giant shared repository for structured, semi-structured, and unstructured data. When data gets to a certain size, just moving it becomes a burden. It’s better to keep it all in one place where it can be managed, secured, and made available to users in a controlled and predictable manner.
But as attractive as the data lake proposition is, the whole thing melts down upon closer inspection. Yes, Hadoop’s rapid evolution is lowering the barriers of entry to the types of scale-out systems that companies like Google and Facebook use to run their businesses. The Hadoop stack is the embodiment of decade’s worth of data science and represents the future of big data analytics.
However, just as technological barriers have lowered and big storage needs have skyrocketed for entire companies and organizations, these same dynamics are occurring for individual departments and groups. And therein lies the dilemma. What makes sense for the company or organization also makes sense for smaller groups within.
Big corporations have been working to solve the problem of the proliferation of data silos and the lack of unified master data management (MDM) for the past 30 years. So why would one think they’ve now been sufficiently solved to the point where it’s feasible to build a single all-encompassing data lake that serves the entire company?
Gartner dismissed the data lake concept with a report earlier this year that suggested users beware of the “data lake fallacy.” “The need for increased agility and accessibility for data analysis is the primary driver for data lakes,” said Andrew White, vice president and distinguished analyst at Gartner. “Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
In lieu of the master data management (MDM) silver bullet, companies and organizations will continue to collect and analyze data in the same ad-hoc manner they’ve been doing all their lives, says Theo Vassilakis, the co-founder and CEO of Metanautix and one of the developers of Google’s Dremel, the distributed query engine that powers Google’s BigQuery.
“The message isn’t so much there won’t be lakes, but much more that each area [of the business] will want their own, just as they wanted their own warehouse and they wanted their own marts and things of that nature,” Vassilakis tells Datanami in an interview.
The big data boom is much more than Hadoop, and is a force that’s powering the burgeoning data economy. The more creative ways that groups and individuals can come up with to generate and consume data, the more successful that group or company is going to be in this emerging data economy.
“So many more parts of a business are creating data now,” Vassilakis says. “That can be as simple as making spreadsheets online or fielding polls on Survyemonkey or using a SaaS app or developing their own mobile apps. Even getting the sense of all the data that’s been generated in the enterprise is hard and it’s accelerating.”
But it would be a mistake to assume that all this data-centric work is going to be scripted or orchestrated from the top down in an organized and controllable manner. Instead, each group is going to lurch forward in haphazard fashion, following the mantra of continuous iterative development and “fail fast” that’s burned into the new data-centric economy.
The way Vassilakis sees it, all of this new data analysis work is not going to happen in a single data lake, and it’s not going to happen in just Hadoop. Sure Hadoop will be involved, but it’s also going to involve DB2 and mainframes and Oracle and Teradata and Google Analytics and Salesforce and wherever else the data resides.
“The analyst is going to want to join that piece of data they made with whatever their established thing is,” he says. “Are you now prepared to do the work of pulling it into a Hadoop cluster to combine it with the other data?…Our view of the dynamic is, chances are you’re probably not. If you have an easy way to access that bit of data, your path will be that.”
That decentralized data architecture was the norm at Google, where Vassilakis worked, and at Facebook, where Metanautix other co-founder, Toli Lerios, worked. The notion of developing a single centralized data lake is at odds with the practical boundaries and momentum of the workplace.
This tension is evident with SaaS vendors, such as Google Analytics. Google is more than happy to help you analyze your website traffic, provided you analyze the data in Google’s application. “It’s your data. You own it,” Vassilakis says. ” But you can’t download it. There’s no immediate provision to download it. Google will be happy to help you push it into Google Compute Storage and use Google BigQuery to analyze it. It’s a way to build on the competitive value you have by giving people a richer way to interact with the data you helped them create.”
Eventually that Google Analytics data needs to meet other data to get the highest value from of it. So either the company uploads other data into Google’s cloud, or Google lets the customer download some data to their premises. Or–more than likely—it’s all of the above.
“The dynamics between that SaaS provider and the customers are comparable to the dynamics of different divisions of a corporation and its EDW,” Vassilakis says. “That story will play out again and again because it’s the same actors, but in different roles. And how it ended last time was there was no central EDW. There were different data marts. There was some level of centralization for some kinds of things, and there was some level of fragmentation as well, and each business tolerated some different factor of those two endpoints, depending on how they needed to operate.”
Analysts will take the path of least resistance when merging data from different lakes or silos, and that’s where Metanautix’s new SQL-based tool, called Quest, comes in. The software functions a bit like a distributed ETL tool, and enables analysts to grab and merge the data they need when they need it. Quest is built atop a column-oriented database, but it doesn’t store any data beyond what it needs to cache to execute its SQL queries, and it will run wherever there are Java-compatible resources, be it Hadoop, an EDW, or even a mainframe.
Quest is able to gather and join data from thousands of machines. “One of the demos we do for people is we show them instances of Quest running in different places,” Vassilakis says. “We’ll show 1,000 clusters on AWS and 10 machines in our office and 100 machines on something else. Then we’ll run a single query that goes and hits each of those three clusters.”
In the real world, each of those three clusters might be run by different organizations. “And those organizations might not be prepared to give you their data outright maybe because they don’t want to or their polices or security requirements don’t let them,” Vassilakis continues. “But maybe they’re willing to let you run queries from time to time, provided they can see what the query is and can control how often you’re running it and they can block them and log them and be able to tell which queries you ran. We think these scenarios are going to be a lot more pervasive.”
Real life is messy and rarely unfolds the way we script it. That’s not to say we shouldn’t try to improve upon the past. But when it comes to the way we create and store and analyze data, don’t hold your breath waiting for a big data lake to solve your problems.
August 4, 2021
- Ken Kennedy AI and Data Science Conference Call for Participation and Program Announced
- Scality Attains Cohasset SEC Compliance for Secure Fintech Data Storage
- Data Science Leaders Say their Companies Focus on Short-Term Payoffs
- Fujitsu AI Scoring Platform Powers New Galileo XAI Solution from LARUS for Financial Services
- Spell Operationalizes Advanced AI with a Comprehensive MLOps Platform for Deep Learning
- Elastic Introduces Free and Open Limitless XDR
- New Sunlight.io Commissioned Research Shows Manageability Key Challenge of Deploying at the Far Edge
August 3, 2021
- Presto Company Ahana Raises $20M Series A Led by Third Point Ventures
- Open Source Face Recognition Application CompreFace Adds New Features
- Deepgram Enables Developers to Build the Future of Voice with New Features, $10M in Free Speech Recognition
- RENCI-Supported Data Matters Short-Course Series Returns in August 2021
- NYU Scientists Invent a New Information Storage and Processing Device
August 2, 2021
- Asian Telecommunications Companies Achieve Data-Driven Innovation with Cloudera
- Honeypot Security Technique Can Also Stop Attacks in Natural Language Processing
- KX Named Official Supplier of Real-time Data Analytics to Alpine F1 Team
- Weights & Biases Teams Up with NVIDIA to Accelerate Machine Learning
- Qlik, Fortune Launch ‘The Pandemic Effect on the Fortune Global 500’ Data Analytics Site
- Aunalytics Leverages Alluxio as a ‘One-Stop-Shop’ for Data I/O with Faster Analytics
- Unbabel Launches MT-Telescope to Deeply Understand Machine Translation Performance
- Graph Database Market Worth $5.1 Billion by 2026
Most Read Features
- Big Data File Formats Demystified
- The Data Mesh Emerges In Pursuit of Data Harmony
- Tuplex Gives Python UDFs a Performance Boost
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Who’s Winning In the $17B AIOps and Observability Market
- How Coke Bottlers Save Millions with AI
- Why Data Scientists and ML Engineers Shouldn’t Worry About the Rise of AutoML
- The Multiple Faces of Digital Twins
- Cost Overruns and Misgovernance: Two Threats to Your Cloud Data Journey
- Beyond Dashboards: The Importance of Storytelling with Data
- More Features…
Most Read News In Brief
- Why Young Developers Don’t Get Knowledge Graphs
- Why Is SAS Going Public?
- AI Takes the Stage at the Summer Olympics
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Starburst Backs Data Mesh Architecture
- Off the Couch: Database Maker Seeks $160 Million In IPO
- Ahana Grabs $20M to Grow Presto Biz
- Confluent Raises More Than $800M in IPO
- Qumulo Debuts QaaS, a File Lake on the Azure Cloud
- Couchbase Pops in Stock Market Debut
- More News In Brief…
Most Read This Just In
- Red Hat Expands Workload Possibilities Across Hybrid Cloud with Latest Version of OpenShift
- SAS Charts Path to IPO-Readiness
- Splunk Launches New Security Cloud
- Vertica Announces Vertica 11, Delivering on Vision of Unified Analytics
- Alation Supports Next Generation of Data Enthusiasts, Provides Free Software and Training
- Alteryx, PwC Expand Relationship Globally to Address Analytics Automation Demand
- Alteryx Becomes Elite Partner in Snowflake Partner Network to Accelerate Data Science Automation
- Alluxio v2.6 Release Brings Performance, Ease of Use Improvements to AI/ML Workloads
- Dremio Launches SQL Lakehouse Service to Accelerate BI and Analytics
- Observable Introduces Data Visualization Stack for the Enterprise
- More This Just In…
Sponsored Partner Content
August 25 @ 12:00 pm - 5:00 pm
October 25 - October 29Hollywood FL United States
November 29 - December 3
December 6 - December 10San Diego CA United States