The Land of a Thousand Big Data Lakes
The prospect of storing and processing all of one’s data in an enterprise data lake running on Hadoop is gaining momentum, particularly when it comes to today’s massive unstructured data flows. However, given what we know of technological evolution and human nature itself, the chance of eliminating data silos and centralizing storage and compute is slim this big-data age.
Data lakes make a lot of sense conceptually. Instead of allowing silos to perpetuate, an organization pools all of its resources together into a giant shared repository for structured, semi-structured, and unstructured data. When data gets to a certain size, just moving it becomes a burden. It’s better to keep it all in one place where it can be managed, secured, and made available to users in a controlled and predictable manner.
But as attractive as the data lake proposition is, the whole thing melts down upon closer inspection. Yes, Hadoop’s rapid evolution is lowering the barriers of entry to the types of scale-out systems that companies like Google and Facebook use to run their businesses. The Hadoop stack is the embodiment of decade’s worth of data science and represents the future of big data analytics.
However, just as technological barriers have lowered and big storage needs have skyrocketed for entire companies and organizations, these same dynamics are occurring for individual departments and groups. And therein lies the dilemma. What makes sense for the company or organization also makes sense for smaller groups within.
Big corporations have been working to solve the problem of the proliferation of data silos and the lack of unified master data management (MDM) for the past 30 years. So why would one think they’ve now been sufficiently solved to the point where it’s feasible to build a single all-encompassing data lake that serves the entire company?
Gartner dismissed the data lake concept with a report earlier this year that suggested users beware of the “data lake fallacy.” “The need for increased agility and accessibility for data analysis is the primary driver for data lakes,” said Andrew White, vice president and distinguished analyst at Gartner. “Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
In lieu of the master data management (MDM) silver bullet, companies and organizations will continue to collect and analyze data in the same ad-hoc manner they’ve been doing all their lives, says Theo Vassilakis, the co-founder and CEO of Metanautix and one of the developers of Google’s Dremel, the distributed query engine that powers Google’s BigQuery.
“The message isn’t so much there won’t be lakes, but much more that each area [of the business] will want their own, just as they wanted their own warehouse and they wanted their own marts and things of that nature,” Vassilakis tells Datanami in an interview.
The big data boom is much more than Hadoop, and is a force that’s powering the burgeoning data economy. The more creative ways that groups and individuals can come up with to generate and consume data, the more successful that group or company is going to be in this emerging data economy.
“So many more parts of a business are creating data now,” Vassilakis says. “That can be as simple as making spreadsheets online or fielding polls on Survyemonkey or using a SaaS app or developing their own mobile apps. Even getting the sense of all the data that’s been generated in the enterprise is hard and it’s accelerating.”
But it would be a mistake to assume that all this data-centric work is going to be scripted or orchestrated from the top down in an organized and controllable manner. Instead, each group is going to lurch forward in haphazard fashion, following the mantra of continuous iterative development and “fail fast” that’s burned into the new data-centric economy.
The way Vassilakis sees it, all of this new data analysis work is not going to happen in a single data lake, and it’s not going to happen in just Hadoop. Sure Hadoop will be involved, but it’s also going to involve DB2 and mainframes and Oracle and Teradata and Google Analytics and Salesforce and wherever else the data resides.
“The analyst is going to want to join that piece of data they made with whatever their established thing is,” he says. “Are you now prepared to do the work of pulling it into a Hadoop cluster to combine it with the other data?…Our view of the dynamic is, chances are you’re probably not. If you have an easy way to access that bit of data, your path will be that.”
That decentralized data architecture was the norm at Google, where Vassilakis worked, and at Facebook, where Metanautix other co-founder, Toli Lerios, worked. The notion of developing a single centralized data lake is at odds with the practical boundaries and momentum of the workplace.
This tension is evident with SaaS vendors, such as Google Analytics. Google is more than happy to help you analyze your website traffic, provided you analyze the data in Google’s application. “It’s your data. You own it,” Vassilakis says. ” But you can’t download it. There’s no immediate provision to download it. Google will be happy to help you push it into Google Compute Storage and use Google BigQuery to analyze it. It’s a way to build on the competitive value you have by giving people a richer way to interact with the data you helped them create.”
Eventually that Google Analytics data needs to meet other data to get the highest value from of it. So either the company uploads other data into Google’s cloud, or Google lets the customer download some data to their premises. Or–more than likely—it’s all of the above.
“The dynamics between that SaaS provider and the customers are comparable to the dynamics of different divisions of a corporation and its EDW,” Vassilakis says. “That story will play out again and again because it’s the same actors, but in different roles. And how it ended last time was there was no central EDW. There were different data marts. There was some level of centralization for some kinds of things, and there was some level of fragmentation as well, and each business tolerated some different factor of those two endpoints, depending on how they needed to operate.”
Analysts will take the path of least resistance when merging data from different lakes or silos, and that’s where Metanautix’s new SQL-based tool, called Quest, comes in. The software functions a bit like a distributed ETL tool, and enables analysts to grab and merge the data they need when they need it. Quest is built atop a column-oriented database, but it doesn’t store any data beyond what it needs to cache to execute its SQL queries, and it will run wherever there are Java-compatible resources, be it Hadoop, an EDW, or even a mainframe.
Quest is able to gather and join data from thousands of machines. “One of the demos we do for people is we show them instances of Quest running in different places,” Vassilakis says. “We’ll show 1,000 clusters on AWS and 10 machines in our office and 100 machines on something else. Then we’ll run a single query that goes and hits each of those three clusters.”
In the real world, each of those three clusters might be run by different organizations. “And those organizations might not be prepared to give you their data outright maybe because they don’t want to or their polices or security requirements don’t let them,” Vassilakis continues. “But maybe they’re willing to let you run queries from time to time, provided they can see what the query is and can control how often you’re running it and they can block them and log them and be able to tell which queries you ran. We think these scenarios are going to be a lot more pervasive.”
Real life is messy and rarely unfolds the way we script it. That’s not to say we shouldn’t try to improve upon the past. But when it comes to the way we create and store and analyze data, don’t hold your breath waiting for a big data lake to solve your problems.
October 25, 2021
- Sun Life Deploys Privacera to Accelerate AWS Migration
- The expert.ai NL API Now Available in AWS Marketplace
- Franz Announces AllegroGraph 7.2
- Teradata and H2O.ai Partnership Accelerates Enterprise AI Adoption in the Cloud
October 22, 2021
October 21, 2021
- Dremio Announces New Dart Initiative Release
- Hex Technologies Raises $16 Million Series A to Help Data Teams Do More
- 2021 GigaOm Radar Report for Data Warehouses Names Yellowbrick Data an Outperformer
- DataRobot Research Finds 86% of Organizations Prioritize AI and ML
- Terrafuse AI Launches New Platform to Visualize California Wildfire Risk
- New Relic Launches In-IDE Observability and Code Collaboration Experience
- KX Announces Launch of KX Academy On-Demand Training Portal
- KDD 2021 Celebrates Winning Teams of 25th Annual KDD Cup
- Global Survey Reveals 8 in 10 Companies Struggle to Unify Data Assets
October 20, 2021
- OctoML Announces Collaboration with Arm for ML Models
- VAST Data Introduces VASTOS Version 4
- DAS42 and AtScale Partner to Deliver Advanced Data Technology Solutions
- Iguazio MLOps Platform Now Supports Amazon FSx for NetApp ONTAP
- Credo AI Emerges from Stealth to Help Organizations Build Ethical AI
- Exxact Partners with SoftIron to Provide Ceph-Based Software Defined Storage Solutions
Most Read Features
- Google Cloud Gives Spanner a PostgreSQL Interface
- What Is Data Science? A Turing Award Winner Shares His View
- Big Data File Formats Demystified
- We’re In the Moneyball 3.0 Era. Here’s What It Means for Live Sports
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Who’s Winning In the $17B AIOps and Observability Market
- Composite AI: What Is It, and Why You Need It
- Five Real-World Applications for Sports Analytics
- One on One with Google Cloud Product Director Irina Farooq
- HPE Adds Lakehouse to GreenLake, Targets Databricks
- More Features…
Most Read News In Brief
- Data and AI Salaries Continue Upward March, O’Reilly Says
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- The Next Breakthrough in Long-Term Data Storage is….Gold?
- Gartner Shuffles the Technology Deck with Latest ‘Hype Cycle’ Report
- Why Is SAS Going Public?
- Feature Stores Emerging as Must-Have Tech for Machine Learning
- Sisu Nabs $62M to Grow Data Analytics Biz
- LinkedIn Open Sources Tech Behind 10,000-Node Hadoop Cluster
- Here’s What Splunk Announced Today at .conf21
- Hydrolix Puts Big Log Data In Its Place: The Cloud
- More News In Brief…
Most Read This Just In
- Esri Releases ArcGIS GeoBIM, Bringing Spatial Context to AEC Operations
- Databricks Acquires Low-code/No-code Company to Expand its Lakehouse Platform
- PrivaceraCloud 4.0 Enables Governed Data Sharing Across the Open Cloud
- NetApp to Acquire CloudCheckr and Expand its Spot by NetApp CloudOps Platform
- TIBCO Delivers a Comprehensive, Connected Platform for the Adaptable Digital Business
- Dremio Announces New Dart Initiative Release
- BriefCam Introduces Video Analytics Enabled on Deep Learning Cameras from Axis Communications
- Transaction Processing Performance Council (TPC) Launches an Artificial Intelligence Benchmark (TPCx-AI)
- Sinequa Accelerates Time-to-Value with “Starter” Insight Apps
- Fluent Project Creators Announce Calyptia Cloud
- More This Just In…
Sponsored Partner Content
October 27 - October 28
November 29 - December 3
December 6 - December 10San Diego CA United States
February 7, 2022 - February 9, 2022Houston TX United States
June 26, 2022 - June 30, 2022Hollywood FL United States