A Peek at the Future of the Open Data Architecture
Hadoop may have fizzled out as a data platform, but it laid the groundwork for an open data architecture that continues to grow and evolve today, largely in the cloud. We got a peek at the future of this open data architecture during the recent Subsurface conference, which featured the creators of several promising technologies for data lakes and data lakehouses.
Much of the exciting work in data architecture today is happening in the cloud. Thanks to the availability of infinite, object-based storage (such as S3) and limitless on-demand compute (thanks to Docker and Kubernetes), the physical limitations of collecting, storing, and processing massive amounts of data have largely gone away (it has also introduced new cost concerns, but that is another topic for another day).
When one problem is solved, new problems typically come into view. In this case, as storage and compute have been “solved,” the focus now becomes how best to enable the largest group of users to access and use this data in the most impactful way. For a variety of reasons, this is not a solved problem, especially when it comes to the burgeoning big data environments. Attempts to pigeonhole legacy data management technologies and techniques into this new cloud-data paradigm have had mixed success.
In short, with the new cloud era of data upon us, the thinking goes, we need new tools and technologies to take advantage of it. This is precisely what a new generation of technologists who advocate for open data tools to work in the open data architecture are hoping to do. It’s also what cloud analytics vendor Dremio focused on with its Subsurface Live conference, held virtually in late July.
In a Subsurface panel on the future of the open data architecture, Gartner analyst Sanjeev Mohan talked about the future with four folks who are creating these technologies, including Wes McKinney, creator of Pandas a co-creator of Apache Arrow; Ryan Blue, creator of the Iceberg table format; Julien Le Dem, co-creator of Parquet; and Ryan Murray, co-creator of Nessie.
“It’s very exciting to see that a journey that we started in open source many decades ago seems to be coming together,” Mohan said. “We finally seem to be in a place where, in open data architecture, we now have a set of open source projects that complete each other, and they help us build out an end-to-end solution.”
Take Apache Iceberg, for example. The technology was originally developed by engineers at Netflix and Apple to address the performance and usability challenges of using Apache Hive tables. While Hive is just one of a number of SQL analytic engines, the Hive metastore has lived on as the de facto glue connecting data stored in HDFS and S3 with modern SQL engines, like Dremio, Presto, and Spark.
Unfortunately, the Hive metastore doesn’t do well in dynamic big data environments. Changes to data have to be coordinated, which can be a complex, error-prone process. When it’s not done correctly, the data can be corrupted. As a replacement for Hive tables, Iceberg brings support for atomic transactions, which gives users that correctness guarantee.
But that wasn’t enough. As we have learned, when one problem is solved, another one tends to pop up. In the case of Project Nessie, there was a need to provide version control for data stored in table formats such as Iceberg.
“When we started thinking about Project Nessie, we started really thinking about the progression of the of data lake platform of the past 10 or 15 years,” said Murray, a Dremio engineer. “We’ve seen people [slowly]…building up abstractions, whether that’s abstractions to help us compute, or abstractions for things like tables and data files and that kind of stuff. We started thinking, what’s the next abstraction? What’s the thing that makes the most sense?”
For Murray, the next abstraction that was needed was a catalog that sat on top of the table formats to foster better interaction with downstream components.
“Just as Ryan Blue felt that Aache Hive wasn’t well-suited for the table format–with the single point of failure, the huge number of API calls to that metastore, even the Thrift endpoint–made it really hard to scale, made it really hard to use effectively, especially in a cloud-native way,” Murray said. “So we were looking at something that was going to be cloud-native and would work with modern table formats and we could start thinking about extending to all the other wonderful things that my panel is building.”
As one of the most popular big data formats, Parquet is another technology that was originally developed for Hadoop but has continued to see wide adoption after Hadoop adoption has tailed off, thanks to its ability to be used in cloud object-stores. The columnar format gives users the ability to power through demanding analytic queries, a la Teradata, while its compression and native support for distributed file systems lets it work in modern big data clusters.
Le Dem co-developed Parquet while working at Twitter, which did much of its data analysis on either Hadoop or Vertica. Hadoop could scale for big data sets, but it lacked performance for demanding queries. Vertica was the opposite–it could handle the ad-hoc queries with good performance, but it just couldn’t handle big data.
“We were always in between the two options,” Le Dem said. “And I think some of it was making Hadoop more like warehouse. Starting from the bottom up, starting with the columnar presentation, and make it more performant, following the tracks of those columnar databases.”
While Parquet has seen tremendous adoption, there are still fundamental limitations in what it can do. “Parquet is just a file format,” Le Dem said. “It makes things more performant for the query engine, but it doesn’t deal with anything like, how do you create a table, how do you do all those things. So we needed a layer on top. It was great to see this happening in the community.”
This brings us to Apache Arrow, which was co-developed by McKinney and which Le Dem is also involved in developing. Arrow’s contribution to the open data architecture is that it provides a very fast file format for sharing data amongst a large collection of systems and query engines. That heterogeneity is a feature of the open data architecture, Le Dem said.
“One of the drivers for this open storage architecture is people don’t just use one tool,” Le Dem said. “They [use] things like Spark, they use things like Pandas. They use warehouses, or the SQL-on-Hadoop type things, like Dremio and Presto, but also other proprietary warehouses. So there’s lots of fragmentation, but they still want to be able to use all those tools and machine learning on the same data. So having this common storage layer [Arrow] makes a lot of sense to standardize this so that you can create and transform data from various sources.”
The need for Arrow arose in the midst of the Hadoop hype cycle. “Around six years ago, we recognized that that…the community had developed Parquet as an open standard for data storage and data warehousing for data lakes and for the Hadoop ecosystem,” McKinney said.
“But we were increasingly seeing this rise of application and programming language heterogeneity, where you need to like applications are increasingly bottleneck on moving large amounts of data between programming languages, between application processes and going through a more expensive intermediary, like Parquet, to move data between two different steps in the application pipeline, is very expensive,” he continued.
McKinney, who recently folded Ursa Computing into his new startup Voltron Data, today is working on Arrow Flight, a framework for fast data transport that sits on top of gPRC, a remote procedure call (PRC) technology that works as a protocol buffer for distributed applications. One extension for Arrow Flight could eventually be a replacement for JDBC and ODBC, enabling fast data transformation across the board, McKinney said.
In the future, as technologies like Arrow, Iceberg, Nessie, and Parquet are built into the data ecosystem, it will enable a new generation of productivity among the developers and engineers who are tasked with building data-driven applications, Murray said.
“A lot of data engineers I interact with are thinking about how big is my Parquet file and which directory does it belong in so partitions get taken advantage of, and how to I make sure that has the right schema and all this kind of stuff,” he said. “And I think we’re so ready to just stop talking about that. So that engineers can just start writing SQL and applications on top of these things.”
Freedom of choice is a hallmark of the open data lake architecture, Dremio CTO Tomer Shiran said during his Surface keynote address.
“You can choose the best-of-breed engine for a given workload,” Shiran said. “Not only that, but in the future, as new engines get created, you can choose those engines as well. It becomes very easy to spin up a new engine, point it at your data, your open source Parquet files or your open source Iceberg tables, and start querying and modifying that data.”
Open data lakes and lake houses are gaining traction in the market, and thanks to technologies like this, will become the predominant architecture in the future, predicts Dremio CEO Billy Bosworth.
“When you have these architecture shifts like we’re seeing today, from classic relational database structures into these open data lake architectures, these kinds of shifts tend to last for decades,” Bosworth said during his Subsurface session. “Our engineers and architects are building that future for all of us, a future where things are more easily accessible where data comes in faster in the time to value on that data is rapidly increased. And it does so in a way that allows people to have best of breed options in the types of services that they want to use against that data.”