The Land of a Thousand Big Data Lakes
The prospect of storing and processing all of one’s data in an enterprise data lake running on Hadoop is gaining momentum, particularly when it comes to today’s massive unstructured data flows. However, given what we know of technological evolution and human nature itself, the chance of eliminating data silos and centralizing storage and compute is slim this big-data age.
Data lakes make a lot of sense conceptually. Instead of allowing silos to perpetuate, an organization pools all of its resources together into a giant shared repository for structured, semi-structured, and unstructured data. When data gets to a certain size, just moving it becomes a burden. It’s better to keep it all in one place where it can be managed, secured, and made available to users in a controlled and predictable manner.
But as attractive as the data lake proposition is, the whole thing melts down upon closer inspection. Yes, Hadoop’s rapid evolution is lowering the barriers of entry to the types of scale-out systems that companies like Google and Facebook use to run their businesses. The Hadoop stack is the embodiment of decade’s worth of data science and represents the future of big data analytics.
However, just as technological barriers have lowered and big storage needs have skyrocketed for entire companies and organizations, these same dynamics are occurring for individual departments and groups. And therein lies the dilemma. What makes sense for the company or organization also makes sense for smaller groups within.
Big corporations have been working to solve the problem of the proliferation of data silos and the lack of unified master data management (MDM) for the past 30 years. So why would one think they’ve now been sufficiently solved to the point where it’s feasible to build a single all-encompassing data lake that serves the entire company?
Gartner dismissed the data lake concept with a report earlier this year that suggested users beware of the “data lake fallacy.” “The need for increased agility and accessibility for data analysis is the primary driver for data lakes,” said Andrew White, vice president and distinguished analyst at Gartner. “Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.”
In lieu of the master data management (MDM) silver bullet, companies and organizations will continue to collect and analyze data in the same ad-hoc manner they’ve been doing all their lives, says Theo Vassilakis, the co-founder and CEO of Metanautix and one of the developers of Google’s Dremel, the distributed query engine that powers Google’s BigQuery.
“The message isn’t so much there won’t be lakes, but much more that each area [of the business] will want their own, just as they wanted their own warehouse and they wanted their own marts and things of that nature,” Vassilakis tells Datanami in an interview.
The big data boom is much more than Hadoop, and is a force that’s powering the burgeoning data economy. The more creative ways that groups and individuals can come up with to generate and consume data, the more successful that group or company is going to be in this emerging data economy.
“So many more parts of a business are creating data now,” Vassilakis says. “That can be as simple as making spreadsheets online or fielding polls on Survyemonkey or using a SaaS app or developing their own mobile apps. Even getting the sense of all the data that’s been generated in the enterprise is hard and it’s accelerating.”
But it would be a mistake to assume that all this data-centric work is going to be scripted or orchestrated from the top down in an organized and controllable manner. Instead, each group is going to lurch forward in haphazard fashion, following the mantra of continuous iterative development and “fail fast” that’s burned into the new data-centric economy.
The way Vassilakis sees it, all of this new data analysis work is not going to happen in a single data lake, and it’s not going to happen in just Hadoop. Sure Hadoop will be involved, but it’s also going to involve DB2 and mainframes and Oracle and Teradata and Google Analytics and Salesforce and wherever else the data resides.
“The analyst is going to want to join that piece of data they made with whatever their established thing is,” he says. “Are you now prepared to do the work of pulling it into a Hadoop cluster to combine it with the other data?…Our view of the dynamic is, chances are you’re probably not. If you have an easy way to access that bit of data, your path will be that.”
That decentralized data architecture was the norm at Google, where Vassilakis worked, and at Facebook, where Metanautix other co-founder, Toli Lerios, worked. The notion of developing a single centralized data lake is at odds with the practical boundaries and momentum of the workplace.
This tension is evident with SaaS vendors, such as Google Analytics. Google is more than happy to help you analyze your website traffic, provided you analyze the data in Google’s application. “It’s your data. You own it,” Vassilakis says. ” But you can’t download it. There’s no immediate provision to download it. Google will be happy to help you push it into Google Compute Storage and use Google BigQuery to analyze it. It’s a way to build on the competitive value you have by giving people a richer way to interact with the data you helped them create.”
Eventually that Google Analytics data needs to meet other data to get the highest value from of it. So either the company uploads other data into Google’s cloud, or Google lets the customer download some data to their premises. Or–more than likely—it’s all of the above.
“The dynamics between that SaaS provider and the customers are comparable to the dynamics of different divisions of a corporation and its EDW,” Vassilakis says. “That story will play out again and again because it’s the same actors, but in different roles. And how it ended last time was there was no central EDW. There were different data marts. There was some level of centralization for some kinds of things, and there was some level of fragmentation as well, and each business tolerated some different factor of those two endpoints, depending on how they needed to operate.”
Analysts will take the path of least resistance when merging data from different lakes or silos, and that’s where Metanautix’s new SQL-based tool, called Quest, comes in. The software functions a bit like a distributed ETL tool, and enables analysts to grab and merge the data they need when they need it. Quest is built atop a column-oriented database, but it doesn’t store any data beyond what it needs to cache to execute its SQL queries, and it will run wherever there are Java-compatible resources, be it Hadoop, an EDW, or even a mainframe.
Quest is able to gather and join data from thousands of machines. “One of the demos we do for people is we show them instances of Quest running in different places,” Vassilakis says. “We’ll show 1,000 clusters on AWS and 10 machines in our office and 100 machines on something else. Then we’ll run a single query that goes and hits each of those three clusters.”
In the real world, each of those three clusters might be run by different organizations. “And those organizations might not be prepared to give you their data outright maybe because they don’t want to or their polices or security requirements don’t let them,” Vassilakis continues. “But maybe they’re willing to let you run queries from time to time, provided they can see what the query is and can control how often you’re running it and they can block them and log them and be able to tell which queries you ran. We think these scenarios are going to be a lot more pervasive.”
Real life is messy and rarely unfolds the way we script it. That’s not to say we shouldn’t try to improve upon the past. But when it comes to the way we create and store and analyze data, don’t hold your breath waiting for a big data lake to solve your problems.
April 22, 2021
- Tray.io Announces New Capabilities for Automation of Real-Time Event Streams
- Teradata Announces Preliminary First Quarter Fiscal 2021 Financial Results and First Quarter Fiscal 2021 Earnings Release Date
- Utopia AI Determines Amounts of Hate Speech on Different Social Platforms
- SnapLogic Announces Support for Amazon Redshift Console Program
- Kyligence Raises $70 Million Series D Funding Round
- BigID Welcomes $30M Investment from Advent International, Valuing BigID at $1.25B
- Varada Delivers 100x Speed Improvement on 10x More Data in Security Data Lakes
- RapidAPI Raises $60M to Support Developer Growth and Fuel Expansion of Leading API Platform
- Exasol and DataSwitch Join Forces to Help Customers Accelerate their Cloud Modernization Journeys
April 21, 2021
- CUHK Research Team Develops an AI System for Detecting COVID-19 Infections
- DataRobot Names New Global AI Ethicist
- Oracle’s GoldenGate Now Available as an Elastic Pay-As-You-Go Cloud Service
- EU Commission Proposes New Rules for Excellence and Trust in Artificial Intelligence
- Digital Asset Raises $120 Million Growth Round to Expand Daml Data Network
- Neuravest Launches Data Refinery, Consolidating Alternative Data Providers for Investment Portfolios
- TigerGraph Unveils TigerGraph Cloud on Google Cloud Platform and Expanded Global Developer Community
- Hive Announces Series D Funding to Unlock the Next Wave of Intelligent Automation with AI
- Qumulo Expands Global Presence to Asia Pacific, Expands Strategic Partnership with HPE
- FIDO Alliance Creates New Onboarding Standard to Secure Internet of Things
April 20, 2021
Most Read Features
- Big Data File Formats Demystified
- Synthetic Data: Sometimes Better Than the Real Thing
- A ‘Glut’ of Innovation Spotted in Data Science and ML Platforms
- Can Digital Twins Help Modernize Electric Grids?
- Who’s Winning In the $17B AIOps and Observability Market
- Why Data Science Is Still a Top Job
- He Couldn’t Beat Teradata. Now He’s Its CEO
- Cloud Data Warehousing: Understanding Your Options
- Is Python Strangling R to Death?
- Big Data Predictions: What 2020 Will Bring
- More Features…
Most Read News In Brief
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- AWS Adds Explainability to SageMaker
- The Union of Salesforce, Tableau Yields Hybrid ‘Business Science’
- Insightsoftware Loads Up on Embedded Analytics with Logi, Izenda Deals
- Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says
- Databricks Edges Closer to IPO with $1B Round
- Esri Simplifies Developer Access to Location Data with ArcGIS Platform
- Domo Gets the Lead Out with a ‘Palooza
- Dataiku Gets Closer to Snowflake
- Fiverr Adds Data Science Recruiting Category
- More News In Brief…
Most Read This Just In
- Moody’s Analytics Wins Award for Best Use of AI in Banking or FinTech
- Aiven Raises $100M Series C to Expand Global Open Source Innovation
- Alluxio Advances Analytics and AI with NVIDIA Accelerated Computing
- AWS Announced Strategic Partnership with Hugging Face NLP Startup
- GrafanaCONline Returns June 7-17, CFP Is Open Now
- y42 Raises $2.9M to Provide a Scalable and Affordable Data Stack to Companies of All Sizes
- Novel Use of 3D Geoinformation to Identify Urban Farming Sites
- Tecton Unveils Major New Release of Feast Open Source Feature Store
- SC21: Introducing the [email protected] Data Science Competition
- KIOXIA’s PCIe 4.0 NVMe SSDs Now Qualified with NVIDIA Magnum IO GPUDirect Storage
- More This Just In…
Sponsored Partner Content
May 4 - May 5
May 13 @ 11:00 am - 12:30 pm