Follow Datanami:
July 30, 2024

Polaris Catalog, To Be Merged With Nessie, Now Available on GitHub

Seven weeks after taking the wraps off Polaris Catalog at its annual user conference, Snowflake today announced that its metadata catalog for the Apache Iceberg table format is now available on GitHub and as a public preview on its cloud. The data warehousing giant also announced plans to merge Polaris with Project Nessie, a metadata catalog developed by Dremio for Iceberg, thereby helping to nip “catalog sprawl” in the bud.

Snowflake’s unveiling of Polaris at its Data Cloud Summit in early June was a watershed moment for the company, as it marked Snowflake’s full embrace of open data formats and frameworks and a departure from the company’s preference for proprietary big data formats that lock customers in.

While Snowflake’s Iceberg journey had been evolving for two years, the introduction of Polaris solidified the move to open formats, and for the first time gave Snowflake customers the option to run open-source query engines, such as Apache Spark, Apache Flink, Presto, Trino, and Dremio, on their Iceberg data, in addition to continuing to run Snowflake’s proprietary SQL query engine atop data customers store in Snowflake’s proprietary table format.

At the Data Cloud Summit, Snowflake promised to contribute the source code for Polaris Catalog to the big data community within 90 days, and it did it today on the 50th day. The speculation is that Snowflake will contribute the code to the Apache Software Foundation.

By putting Polaris Catalog on GitHub with a permissive Apache 2.0 license, the big data community is now free to begin using it and contributing updates and fixes back into the project. The hope is the big data community will embrace Polaris as a standards for metadata catalog, Snowflake engineers Tyler Akidau and Russell Spitzer, Snowflake principal software engineers, and Scott Teal, a product marketing manager for data lake, wrote in a Snowflake blog today.

“Just as large communities have grown in support of open source projects for open file and table formats, there is a community emerging to collaborate on standards for metadata catalogs,” they wrote. “Diversity of ideas and community contributions creates the most interoperable catalog across the widest variety of tools.”

The authors point out that Polaris implements Iceberg’s REST catalog specification, “which means it already enables interoperability with Apache Doris, Apache Flink, Apache Spark, Daft, DuckDB, Presto, Snowflake, Starburst, Trino, Upsolver and more.” Other industry players that have committed to adding integrations to Polaris or making contributions to the project include  Alation, ALTR, Atlan, Collibra, dbt Labs, data.world, Dremio, Confluent, Fivetran, Google Cloud, Immuta, Microsoft, and Salesforce, they wrote.

One company that’s already made a big contribution to Polaris is Dremio, through Project Nessie, another metadata catalog developed in 2020 to work with Iceberg tables. Nessie was developed to provide a Git-like experience for data within a metadata catalog, thereby enabling users and tools to “track changes, isolate modifications with branching, merge changes for publication, and create tags for easily replicable points in time across all your tables simultaneously,” Dremio authors write in a May blog post.

Merging Nessie into Polaris helps to foster “an inclusive community dedicated to developing the most robust open source catalog for open lakehouse architectures,” the Snowflake engineers wrote. “Innovating in one project reduces catalog sprawl and enables a broader group of contributors to drive rapid advancements. This partnership not only accelerates technical progress but also brings more contributors into the Nessie community, further strengthening the growing ecosystem around Polaris.”

Tomer Shiran, a co-founder and chief product officer at Dremio, applaud the move merging of Nessie into Polaris.

“As co-founders of Apache Arrow, creators of Project Nessie and significant contributors to Apache Iceberg, openness is ingrained in Dremio’s culture,” Shiran writes in the Snowflake blog. “We are delighted to support the launch of Polaris Catalog as open source under the Apache license and look forward to actively contributing to its success.

“With over four years of experience building Project Nessie as an open source Apache Iceberg Catalog, we’re excited to share its differentiated capabilities, such as catalog-level versioning, multi-engine support, multi-table transactions and Git for data, with Polaris Catalog and the broader community,” he continues.

Project Nessie will remain independent until the technical details of how to merge the two projects can be worked out, according to Read Maloney, Dremio’s chief marketing officer.

“Polaris Catalog is intended to be a community-driven open source project, as such, commitments will need to be approved by a committee that represents the community,” Maloney tells Datanami. “Snowflake and Dremio have every intent to contribute and merge Project Nessie with Polaris Catalog.”

Snowflake also announced that it has started a product preview for its Polaris-based metadata catalog service. Snowflake says that it “handles the responsibilities of running the service like providing an endpoint, deploying bug fixes, and users get a completely portable catalog for their data, which can be used with Iceberg REST catalog-compatible tools.

Snowflake users who are interested in the hosted Polaris service can check out the company’s documentation to get started.

Related Items:

What the Big Fuss Over Table Formats and Metadata Catalogs Is All About

Data Catalogs Vs. Metadata Catalogs: What’s the Difference?

Snowflake Embraces Open Data with Polaris Catalog

Editor’s note: This story was updated to reflect the current status of Snowflake’s plans to contribute the Polaris Catalog to an open source foundation.

Datanami