A Truce in the Cloud Data Lake Vs. Data Warehouse War?
At the 2nd Annual Semantic Layer Summit, which took place April 26, AtScale founder and CTO Dave Mariani sat down with Bill Inmon, recognized by many as the father of the data warehouse, to discuss the evolution of modern cloud data platforms. The two did their best to dissect the origins of the conflict between a cloud data lake approach and a cloud data warehouse approach. Here is a preview of their discussion.
Dave Mariani: Bill, controversy around data architecture is not new to you. Before we launch into the current philosophical debate around Data Warehouse or Data Lakehouse, let’s revisit the original debate with the Inmon vs. Kimball method. Can you help us understand what that debate was about?
Bill Inmon: Let’s talk about the great Inmon-Kimball debate. Ralph Kimball was answering a different question than I was answering; Ralph was answering the question, “how do we quickly produce analytical systems?” I was answering the question, “how do we produce data that applies across the enterprise?” Now those two questions may sound the same – but they’re not the same at all.
Many may be surprised to know that I often recommend the Kimball architecture. For example, when a company says to me: “Bill, we’ve got some applications here, and we need to create analytical systems from them,” it’s easy to see that they need a Kimball architecture. On the other hand, when a company is looking to answer a question about how many customers they have, or how many products they have, or what the sales figures are, then they need the Inman architecture. With the Kimball approach, fast systems can be created quickly, but with the Inmon approach, a company is building an architecture to enable them to answer these and future questions.
Mariani: So, would you then say that the Kimball method is more suitable for departmental or business-level analysis, versus the Inmon method, which is better suited for enterprise-wide analysis?
Inmon: Absolutely. If you want enterprise data, you need the Inmon approach. If you want quick results for a department, then you need the Kimball approach.
Mariani: Ok. So it sounds like this was a debate about data architecture and approach. We have another similar philosophical debate going on today about what the proper approach is to deliver self service-analytics to a business. I’m talking about Data Warehouses vs. Data Lakehouses. Can you explain what the fuss is all about?
Inmon: In today’s world, we have several technologies – such as AI, ML, data mesh, etc. – competing for attention, and vendors telling companies how much each will help them. The problem is that all of those technologies depend on data – and if you don’t have data that’s believable, you get the old axiom of garbage in, garbage out.
To make these technologies work, data is needed, but it isn’t easily available. The first obstacle to getting these tools and technologies what they need is that there are three kinds of data found in a corporation: structured data, textual data, and analog data. Each is very different from the other, and each have their own rules of engagement. The good practices you learned in the world of structured data don’t apply to the world of text. The practices that you learned in textual data, don’t apply to the world of analog data, and so on.
That said, it’s not just the different kinds of data, but the integrity of the data as well.
A number of years ago, the idea was put forth to solve the problem by building a data lake. In layman’s terms, a data lake was a location where companies threw all of their data into, hoping that one day they’d be able to analyze it. This data lake, however, never lived up to its promise and has been one of the colossal failures of our industry.
In a data lake, it’s nearly impossible to find the information you need. The data is simply not usable. I like to think of this as the “data swamp.”
So what can be done to get usable data out of the swamp? This is where data lakehouses come into play.
From an architectural standpoint, there’s a world of difference between a data lake and a data lakehouse. A data lakehouse needs to have an analytical infrastructure that tells users what’s actually in the data lake, how to find it, and what its meaning is. Building this infrastructure becomes more involved, as when you have structured data, you need metadata; when you have textual data, you need ontologies and taxonomies; and when you have machine generated data, you need distillation algorithms. The point is each of these types of data that are in the swamp are different from each other, and need different tools to become useful.
If only structured data is inputted into a data lake, then a classical data warehouse is created. But when textual data and machine generated data is added in as well, the whole demeanor of the data lakehouse changes.
So, yes, a data warehouse and a data lakehouse, are very similar on the surface in terms of form and function, but there are some visceral differences between the two. Anyone who says that there’s been a truce between the data lake and the data lakehouse is flat out wrong. It was a surrender, with the data lake people waving the white flag, saying “help me, I can’t get out of this swamp that I created.”
Mariani: Basically, what you’re saying here is that a data warehouse is really appropriate for structured data, but we live in an age where we have more unstructured data than we do structured data. So a lakehouse is the right approach to go after all of it.
Why is now the time for this new approach?
Inmon: In the past, we didn’t really pay attention to textual data and machine generated data. Even today, it’s still pretty early to be dealing with those types of data. The fact that we’re talking about it really reflects the progression our industry has made.
Mariani: I remember, earlier in my career we had tons of unstructured data and we just parked it in a data lake for use at some point in the future. We couldn’t do much with it. So you’re absolutely correct about that progression.
A lot of vendors that started out as more of a traditional data warehouse are now starting to say, “hey, we’re a lakehouse too, because we can have external tables pointing to files in the data lake.” Do you think that’s the same? Is it a fair categorization to make, or are they cheating?
Inmon: It’s just marketing speak. In my opinion, the only vendor that I’ve seen that has a legitimate claim to this is Databricks. In terms of having the foundation for it, or even having an understanding of what they should be doing, I haven’t seen that in the market as of yet.
Mariani: One of the main arguments for the data warehouse is that if you load data into a data warehouse, then it can optimize the file structure to deliver better performance and scalability. Is that true? Can a Lakehouse deliver the same performance and scalability if it has to rely on the underlying data lake’s file system? Is that a fair argument for going the data warehouse route?
Inmon: When companies begin the process of becoming a data-driven organization, they are often overwhelmed by the large amount of data they already have. It’s too much data to deal with – and it’s disorganized.
However, all is not lost.
What companies need to spend time on is taking that data and looking at it through the lens of business value, and the probability that the company will want to access it. When it comes down to it, a lot of the data that a company collects doesn’t really have any usefulness to it. Nobody’s ever going to want it for any kind of analytics. Some have great business value, and some do not.
This initial analysis enables companies to take that huge volume of data and weed it down to what has business value, making it easier to determine what you want to reside in your data warehouse. If you try to stuff everything in the world in your data warehouse, you will fail.
Lakehouses Prevent Data Swamps, Bill Inmon Says
Mastering the Mesh: Finding Clarity in the Data Lake
Drowning In a Data Lake? Gartner Analyst Offers a Life Preserver