June 29, 2022

Why the Open Sourcing of Databricks Delta Lake Table Format Is a Big Deal

Databricks introduced Delta back in 2019 as a way to gain transactional integrity with the Parquet data table format for Spark cloud workloads. Over time, Delta evolved to become its own table format and also to become more open, but critical elements remained proprietary. But now the entirety of the Delta table format is going open source, as the creator of Delta explained this week at the Data + AI Summit.

Prior to Delta, data lakes and data warehouses occupied separate worlds, explained Michael Armbrust, a Databricks distinguished software engineer and the creator of Delta Lake, Spark SQL, and Spark Structured Streaming.

“Delta was created to unify these two worlds,” Armbrust said in his keynote address at the Moscone Center yesterday. “It brings ACID transactions to the data lake. And it brings speed and indexing and it doesn’t sacrifice in scalability or elasticity. It’s what enables the lakehouse.”

Databricks Delta, as the product was originally called, augmented the company’s Apache Spark-based cloud offering by maintaining the lineage of changes made to Parquet files, the compressed columnar data format that rose to prominence in Hadoop clusters and remains a core workhorse today in modern big data systems in the cloud.

While Parquet is “pretty cool,” it still had a bunch of problems, Armbrust said. “It’s part of a database,” he said. “[But] it turns out that a big collection of files is not a database.”

Databricks Distinguished Software Engineer Michael Armbrust at Data + AI Summit 2022

Armbrust was constantly fielding bug reports from users who complained about Spark being broken. But the problem wasn’t Spark. Instead, the root cause of the problem turned out to be deeper than that. Spark users basically were corrupting their own Parquet tables because there were no transactions, Armbrust said.

“When their jobs failed because their machine was lost and didn’t clean up after itself, multiple people would write to a table and corrupt it,” the former postdoctoral researcher at Google said. “There was no schema enforcement, so if you dropped data with any schema into the folder, it would make it impossible to read. There was a bunch of kind of complexity of working with the cloud. The Hadoop file system just wasn’t really built for it.  I’m sure people in this room remember setting the direct output committer, and if you got it wrong, things would be broken. And even just working with large tables was slow. Just listing all the files could take up to an hour.”

Armbrust discussed the problem with users at the 2017 Spark Summit (as the Data + AI Summit was previously called), and figured there had to be a better way. This was the genesis of Databricks Delta, which Armbrust introduced at the conference in 2018.

“It was one of the first fully transactional storage systems that would preserve all the best parts of the cloud,” he said. “And even better it was battle tested by hundreds of our users at massive scale.”

But Delta was “too good to keep just for Databricks,” Armbrust said. “So in 2019 we came back and announced we open sourced Delta. We didn’t just open source the protocol, the description of how different people can connect to make transactions in the systems. We actually also open s sourced our battle tested Spark reference implementation, put all of that code up on GitHub.”

Work continued on Delta. The company developed commands, such as optimize, which automatically takes tiny files and compacts them into a larger one in a transactionally consistent manner, which delivers better performance.

“We built this really cool command alongside of it called optimize zorder, which actually takes your data and maps it to a multi-dimensional space-filling curve so that you can filter efficiently in multiple dimensions,” Armburst said. “That works really well with this cool trick called data skipping, based on statistics. It’s basically like a coarse-grained index for the cloud. We added the ability to write to these tables from multiple clusters, and a whole bunch of other things that I don’t have time to talk about.”

However, all of this extra stuff wasn’t part of the technology that Databricks was contributing to the open source technology. That has now changed with Delta Lake 2.0, which Databricks announced at the show. With this product launch, all of these capabilities are now being given to anybody who wants to use them.

This approach of developing software in a proprietary manner and then open sourcing it later ensures that the software is higher quality, Databricks CEO Ali Ghodsi said.

Is Delta on a collision course with other table data formats? (vladwel/Shutterstock)

“We found that we can move faster, build the proprietary version, and then open source it when it’s battle tested, like we did with Delta,” Ghodsi said during a press conference Monday. “We can move faster that way. We can iterate quickly with our customer and get it to a mature state before we open source it.”

The move to open source Delta Lake 2.0 also puts Databricks squarely back in the conversation around open data ecosystems. The proprietary nature of Databricks’ software development practice had faced criticism from analytics query tool vendors like Dremio, which had moved to embrace Apache Iceberg, another table data format that has emerged to solve many of the same issues that Databricks addressed with its Delta format.

Dremio CTO Tomer Shiran had wondered whether Databricks was trying to lock customers in by keeping the format proprietary. That speculation is moot now that Delta Lake is open sourced.

Going forward, it will be interesting to watch how Delta and Apache Iceberg compete to win backers in the open data ecosystem. Iceberg has a bit of a head start in establishing itself as a new standard for managing big data in the cloud in a consistent way, and reigning in much of the data chaos that ensued during the Hadoop days.

Many of Databricks competitors, including Snowflake and AWS (which is also a partner with Databricks) have adopted Iceberg, which emerged in 2018 to address longstanding concerns in Apache Hive tables surrounding the correctness and consistency of the data. Whether Iceberg maintains that lead in light of the open sourcing of Delta is anybody’s guess.

Related Items:

Databricks Opens Up Its Delta Lakehouse at Data + AI Summit

Snowflake, AWS Warm Up to Apache Iceberg

How Databricks Keeps Data Quality High with Delta

Editor’s note: This story has been updated to reflect the increasing openness of Delta over time.