

Data catalogs and metadata catalogs share some similarities, particularly in their nearly identical names. And while they have some common functions, there are also important differences between the two entities that big data practitioners should know about.
Metadata catalogs, which are sometimes called metastores or technical data catalogs, have been in the news lately. If you’re a regular Datanami reader (and we certainly hope you are!), you would have read a lot metadata catalogs at the Snowflake and Databricks conferences last month, when the two competitors committed to open sourcing their respective metadata catalogs, Polaris and Unity Catalog.
So what is a metadata catalog, and why do they matter? (We’re glad you asked!) Read on to learn more.
Metadata Catalogs
A metadata catalog is defined as the place where one stores the technical metadata describing the data you have stored as a tabular structure in a data lake or a lakehouse.
The most commonly used metadata catalog is the Hive Metastore, which was the central repository for metadata describing the contents of Apache Hive tables. Hive, of course, was the relational framework that allowed Hadoop users to query HDFS-based data using good old SQL, as opposed to MapReduce.
Hive and the Hive Metastore are still around, but they’re in the process of being replaced by a newer generation of technology. Table formats, such as Apache Iceberg, Apache Hudi, and Databricks Delta Table, bring many advantages over Hive tables, including support for transactions, which boosts the accuracy of data.
These table formats also require a technical layer–the metadata catalog–to help users know what data exists in the tables and to grant or deny access to that data. Databricks supports this function in its Unity Catalog. For Iceberg, products such as Project Nessie, which was developed by engineers at Dremio, sought to be the “transactional catalog” brokering data access to various open and commercial data engines, including Hive, Dremio, Spark, and AWS Athena (based on Presto), among others.
Snowflake developed and released (or pledged to release, anyway) Polaris to be the standard metadata catalog for the Apache Iceberg ecosystem. Like Nessie, Polaris uses Iceberg’s open REST-based API to get access to the descriptive metadata of the Parquet data that Iceberg stores. This REST API then serves as the interface between the data stored in Iceberg tables and data processing engines, such as Snowflake’s native SQL engine as well as a variety of open-source engines.
Data Catalogs
Data catalogs are typically third-party tools that companies use to organize all of the data they have stored across their organizations. They typically include some facility that allows users to search for data their organization may own, which means data catalogs often have some data discovery component.
Many data catalogs, such as Alation’s catalog, have also evolved to include access control functionality, as well as data lineage tracking and governance capabilities. In some cases, data management tool vendors that started out providing data governance and access control, such as Collibra, have evolved the other way, to also include data catalogs and data discovery capabilities.
And like metadata catalogs, regular data catalogs–or what some in the industry term “enterprise” data catalogs–are also fully involved in gobbling up metadata to help them track various data assets. One enterprise data catalog vendor, Atlan, focuses its efforts on unifying the metadata generated by different datasets and synchronizing them through a metadata “control plane,” thereby ensuring that the business metrics don’t get too out of whack.
By now, you’re probably wondering “So what the heck is the difference?! They both track metadata, and they both have “data catalog” in their name. So what’s the difference between a metadata catalog and a data catalog.
So What’s The Difference?!
To help us decode the differences between these two catalog types, Datanami recently talked to Felix Van de Maele, the CEO and co-founder of Collibra, one of the leading data catalog vendors in the big data space.
“They’re very different things,” Van de Maele said. “If you think about Polaris catalog and Unity Catalog from Databricks–and AWS and Google and Microsoft all have their catalogs–it’s really this idea that you’re able to store your data anywhere, on any clouds…And I can use any kind of data engine like a Databricks, like a Snowflake, like a Google, AWS, and so forth, to consume that data.”
But what Collibra and other enterprise data catalogs do is quite different, Van de Maele said.
“What we do is we provide much more of the business context,” he said. “We provide what we call that knowledge graph, that business context where you’re actually defining and managing your policies. Policies such as what’s the quality of my data? What business rules does my data need to comply to? What privacy policies does my data need to comply to? Who needs to approve it? How do we capture attestations? How do we do certification? How do I build a business glossary with business terms and clear definitions?
“That’s very different than a Polaris catalog on top of Iceberg that’s the physical metadata. And that’s a real differentiation,” he said.
Van de Maele supports the open data lakehouse architecture that has emerged, which gives customers the freedom to store their data in open table formats, such as Iceberg, Delta, and Hudi, and query it with any engine. His customers, many of which are Fortune 500 enterprises, store data across many data platforms and use the Collibra Data Intelligence platform to help control and govern access to that data.
Different Roles
Customers should understand that, while the names are similar, metadata catalogs and data catalogs play very different roles.
“The way I differentiate between the two is we do policy definition and management, they do policy enforcement,” Van de Maele said. “And actually I think that’s the right architecture.”
The metadata catalogs typically do not have functionality to allow users to set up business policies around data access. For instance, they won’t let you set up access controls to enable a marketing team to access all customer data except for anything that’s been marked “classified,” in which case it must be masked, Van de Maele said.
“We can have marketing data in Databricks, we have marketing data in Salesforce, we have marketing data in Google, and anywhere people are using marketing data, I need to make sure that the right data is classified and masked,” he said. “So we push that down in Databricks, in Snowflake, in Google, in Amazon and in Microsoft.”
Customers could define their own data access policies without a tool like Collibra’s, Van de Maele said. After all, it’s just SQL at the end of the day. But then they would need some other method to keep track of the millions of columns spread across various data platforms. Providing insight into what data exists and where, and then ensuring customers are accessing it according to the company’s governance rules, is the role Collibra serves.
At the same time, Collibra is dependent upon metadata catalogs for the enforcement mechanisms. Other enforcement mechanisms have been tried, such as proxies and drivers, Van de Maele said, but none of it works.
“We think the metadata catalog approach with open table format is actually the right approach,” he said. “We want to have those data platforms be able to do that natively, otherwise scalability and performance always become a problem.”
Databricks Unity Catalog appears to be the exception here. Unity Catalog, which Databricks just open sourced last month, provides the low-level control over technical metadata as well as higher-level functions, such as data governance, access control, auditing, and lineage. In that respect, Unity Catalog appears to compete with the enterprise data catalog vendors.
Related Items:
What the Big Fuss Over Table Formats and Metadata Catalogs Is All About
Databricks to Open Source Unity Catalog
What to Look for in a Data Catalog
June 20, 2025
- Couchbase to be Acquired by Haveli Investments for $1.5B
- Schneider Electric Targets AI Factory Demands with Prefab Pod and Rack Systems
- Hitachi Vantara Named Leader in GigaOm Report on AI-Optimized Storage
- H2O.ai Opens Nominations for 2025 AI 100 Awards, Honoring Most Influential Leaders in AI
June 19, 2025
- ThoughtSpot Named a Leader in the 2025 Gartner Magic Quadrant for Analytics and BI Platforms
- Sifflet Lands $18M to Scale Enterprise Data Observability Offering
- Pure Storage Introduces Enterprise Data Cloud for Storing Data at Scale
- Incorta Connect Delivers Frictionless ERP Data to Databricks Without ETL Complexity
- KIOXIA Targets AI Workloads with New CD9P Series NVMe SSDs
- Hammerspace Now Available on Oracle Cloud Marketplace
- Domino Launches Spring 2025 Release to Streamline AI Delivery and Governance
June 18, 2025
- WEKA Introduces Adaptive Mesh Storage System for Agentic AI Workloads
- Zilliz Launches Milvus Ambassador Program to Empower AI Infrastructure Advocates Worldwide
- CoreWeave and Weights & Biases Launch Integrated Tools for Scalable AI Development
- BigID Launches 1st Managed DPSM Offering for Global MSSPs and MSPs
- Starburst Named Leader and Fast Mover in GigaOm Radar for Data Lakes and Lakehouses
- StorONE Unveils ONEai for GPU-Optimized, AI-Integrated Data Storage
- Cohesity Adds Deeper MongoDB Integration for Enterprise-Grade Data Protection
- Fivetran Report Finds Enterprises Racing Toward AI Without the Data to Support It
- Datavault AI to Deploy AI-Driven Supercomputing for Biofuel Innovation
- Inside the Chargeback System That Made Harvard’s Storage Sustainable
- What Are Reasoning Models and Why You Should Care
- The GDPR: An Artificial Intelligence Killer?
- It’s Snowflake Vs. Databricks in Dueling Big Data Conferences
- Databricks Takes Top Spot in Gartner DSML Platform Report
- Snowflake Widens Analytics and AI Reach at Summit 25
- Why Snowflake Bought Crunchy Data
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- Change to Apache Iceberg Could Streamline Queries, Open Data
- Fine-Tuning LLM Performance: How Knowledge Graphs Can Help Avoid Missteps
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- It’s Official: Informatica Agrees to Be Bought by Salesforce for $8 Billion
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- AI Agents To Drive Scientific Discovery Within a Year, Altman Predicts
- DuckLake Makes a Splash in the Lakehouse Stack – But Can It Break Through?
- The Top Five Data Labeling Firms According to Everest Group
- ‘The Relational Model Always Wins,’ RelationalAI CEO Says
- Who Is AI Inference Pipeline Builder Chalk?
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- IBM to Buy DataStax for Database, GenAI Capabilities
- More News In Brief…
- Astronomer Unveils New Capabilities in Astro to Streamline Enterprise Data Orchestration
- Yandex Releases World’s Largest Event Dataset for Advancing Recommender Systems
- Astronomer Introduces Astro Observe to Provide Unified Full-Stack Data Orchestration and Observability
- BigID Reports Majority of Enterprises Lack AI Risk Visibility in 2025
- Databricks Unveils Databricks One: A New Way to Bring AI to Every Corner of the Business
- MariaDB Expands Enterprise Platform with Galera Cluster Acquisition
- FICO Announces New Strategic Collaboration Agreement with AWS
- Snowflake Openflow Unlocks Full Data Interoperability, Accelerating Data Movement for AI Innovation
- Databricks Announces Data Intelligence Platform for Communications
- Cisco: Agentic AI Poised to Handle 68% of Customer Service by 2028
- More This Just In…