There Are Many Paths to the Data Lakehouse. Choose Wisely
You don’t need a crystal ball to see that the data lakehouse is the future. At some point soon, it will be the default way of interacting with data, combining scale with cost-effectiveness.
Also easy to predict is that some pathways to the data lakehouse will be more challenging than others.
Companies operating data silos will have the most difficulty in moving to a lakehouse architecture. Transitioning while keeping data partitioned into isolated silos results in more of a swamp than a lakehouse, with no easy way to get insights. The alternative is to invest early in rearchitecting the data structure so that all the lakehouse data is easily accessible for whatever purpose a company wants.
I believe the best approach for a data lakehouse architecture, both now and in the future and no matter how much scale is required, is to choose an open source route. Let me explain why.
Why Choose Data Lakehouses in the First Place?
The transition to data lakehouses is being driven by a number of factors, including their ability to handle massive volumes of data, both structured and — more importantly — unstructured.
When they’re up and running, data lakehouses enable fast query performance for both batch and streaming data, as well as support for real-time analytics, machine learning, and robust access control.
A hallmark of the data lakehouse is its ability to aggregate all of an organization’s data into a single, unified repository. By eliminating data silos, the data lakehouse can become a single source of truth.
Getting From Here to There
All these data lakehouse advantages are real, but that doesn’t mean they’re easy to come by.
Data lakehouses are hybrids combining the best elements of traditional data lakes with the best elements of data warehouses, and their complexity tends to be greater than the sum of the complexities of those two architectures. Their ability to store all kinds of data types is a huge plus, but making all that disparate data discoverable and usable is difficult. And combining batch and real-time data streams is often easier said than done.
Similarly, the promise of fast query performance can fall short when dealing with massive and highly diverse datasets. And the idea of eliminating data silos? Too often, different departments within an organization fail to integrate their data properly into the data lakehouse, or they decide to keep their data separate.
One of the biggest risks, however, is long-term flexibility. Because of the complexity involved, building a data lakehouse on a foundation of any particular vendor or technology means being locked into their technology evolution, pace of upgrades, and overall structure — forever.
The Open Source Alternative
For any organization contemplating the move to a data lakehouse architecture, it’s well worth considering an open source approach. Open source tools for the data lakehouse can be grouped into categories and include:
- Presto distributed SQL query engine
- Apache Spark unified analytics engine
Table Format and Transaction Management
- Apache Iceberg high-performance format for huge analytic tables
- Delta Lake optimized storage layer
- Apache Hudi next-generation streaming data lake platform
- Amundsen, an open source data catalog
- Apache Atlas metadata and big data governance framework
- PyTorch machine learning framework
- TensorFlow software library for machine learning and AI
The open source tools available for building, managing, and using data lakehouses are not only reliable and mature, they have been proven at scale at some of the world’s largest internet-scale companies, including Meta, Uber, and IBM. At the same time, open source data lakehouse technologies are appropriate for organizations of any size that want to optimize their use of disparate kinds of datasets.
The advantages of open source data lakehouses include:
- Open source tools can be mixed and matched with one another and with vendor-specific tools. Organizations can choose the right tools for their particular needs, and be free to change, add, or stop using tools as those needs change over time.
- Cost effectiveness. Open source tools allow storage of huge amounts of data on relatively inexpensive Amazon S3 cloud storage.
- Up-to-date innovation. Put simply, open source is where the vast majority of data lakehouse innovation is happening, and it’s where the industry in general is moving.
- The underlying data lake technology has already been proven to be resilient. The rapidly maturing data lakehouse technology builds on this resilient foundation.
- Future-proofing. Technology changes. That’s a predictable constant. Building a data lakehouse on an open source foundation means avoiding vendor lock-in and all the limitations, risks, and uncertainty that lock-in entails.
Data Lakehouses Aren’t Just for Internet-Scale Companies
To illustrate the broad effectiveness of open source data lakehouse technology, let me walk through an example of a hypothetical business that relies heavily on different data formats. This example is slightly contrived, but is intended to give a sense of how a good data architecture allows an organization to gain insights quickly and move effectively using cost-effective cloud storage and modern data lakehouse tools.
Imagine a chain of modern laundromats scattered across multiple states. This particular laundromat business is heavily data-driven, with an interactive mobile app that patrons use for their laundry services; internet-connected vending machines dispensing laundry supplies and snacks; and sophisticated data analytics and machine learning tools to guide management’s decisions about every aspect of the business.
They decide to do A/B testing on a new mobile app feature. They take the data from all the mobile app users across all their laundromats and ingest it into a data lake on S3, where they can store the data quite inexpensively.
They want to answer quickly: What’s happening? Is the A/B test showing promising results? Adding Presto on top of Iceberg, they query the data to get fast insights. They run some reports on the raw data, then keep an eye on the A/B test for a week, creating a dashboard that queries the data through Presto. Managers can click on the dashboard at any time to see the latest results in real time. This dashboard is powered by data directly from the data lake and took just moments to set up.
After a week, it’s clear that B is performing far above A so they roll out the B version to everyone. They celebrate their increased income.
Now they turn to their vending machines, where they’d like to predict in real time what stock levels they should maintain in the machines. Do they need to alter the stock levels or offerings for different stores, different regions, or different days of the week?
Using PyTorch, they train a machine learning model based on past data, using precision recall testing to decide if they need to tweak the models. Then they use Presto to understand if there are any data quality issues in the models and to validate the precision recall. This process is only possible because the machine learning data is not siloed from the data analytics.
The business has so many laundromats, it’s difficult to query it all if the data is scattered. They reingest the data into Spark, very quickly condensing it into pipelines and creating offline reports that can be queried with Presto. They can see, clearly and at once, the performance metrics across the entire chain of laundromats.
Looking Into the Future
Yes, that’s a dangerous thing to do, but let’s do it anyway.
I see the future of the data lakehouse as becoming an even more integrated experience, and easier to use, over time. When based on open source technologies, data lakehouses will deliver cohesive, singular experiences no matter what technology tools an organization chooses to use.
In fact, I believe that before long, the data lakehouse will be the default way of interacting with data, at any scale. Cloud and open source companies will continue making data lakehouses so easy to use that any organization, of any size and with any business model, can use it from day 1 of their operations.
Data lakehouses won’t solve every business challenge an organization faces, and open source tools won’t solve every data architecture challenge. But data lakehouses built on open source technologies will make the move to a modern data architecture smoother, more economical, and more hassle-free than any other approach.
About the author: Tim Meehan is a Software Engineer at IBM working on the core Presto engine. He is also the Chairperson of the Technical Steering Committee of Presto Foundation that hosts Presto under the Linux Foundation. As the chair and a Presto committer, he is works with other foundation members to drive the technical direction and roadmap of Presto. His interests are in Presto reliability and scalability. Previously, he was a software engineer for Meta.