Data Mesh Vs. Data Fabric: Understanding the Differences
In your quest to build the best data architecture for your organization’s current and future needs, you have many options. Thanks to the mailability of software, those options are nearly infinite. But luckily for you, certain patterns have emerged from the maw that can help you on your data path, including data fabrics and data meshes.
At first glance, the data fabric and the data mesh concepts sound quite similar. Meshes are often made from a type of fabric, after all, and they are both malleable items that can be lain atop things–in this case, your IT systems that are subject to the ever-growing data crush.
But there are fundamental differences to these two approaches, so it’s worth taking some time to learn their differences.
Forrester analyst Noel Yuhanna was among the first individuals to define the data fabric back in the mid-2000s. Conceptually, a big data fabric is essentially a metadata-driven way of connecting a disparate collection of data tools that address key pain points in big data projects in a cohesive and self-service manner. Specifically, data fabric solutions deliver capabilities in the areas of data access, discovery, transformation, integration, security, governance, lineage, and orchestration. Graph is often used to link data assets and users, too.
Momentum is building behind the data fabric concept as a way to simplify access to, and management of, data in an increasingly heterogenous environment that includes transactional and operational data stores, data warehouses, data lakes, and lake houses. Organizations are building more data silos, not fewer, and with the growth of cloud computing, the problems surrounding data diversification are bigger than ever.
With a singular data fabric overlayed virtually atop the various data repositories, an organization can bring some semblance of unified management to the disparate data sources and downstream consumers, including data stewards, data engineers, data analysts, and data scientists. But it’s important to note that the management is unified, not the actual storage, which remains distributed.
Some tools vendors, including Informatica and Talend, offer a soup-to-nuts data fabric that encompasses many of the capabilities discussed above, while others such as Ataccama and Denodo, deliver specific pieces of the data fabric. Google Cloud is also a supporter of the data fabric approach with its new Dataplex offering. Integration among the various components in a data fabric typically is handled via APIs and through the common JSON data format.
While a data mesh aims to solve many of the same problems as a data fabric–namely, the difficulty of managing data in a heterogenous data environment–it tackles the problem in a fundamentally different manner. In short, while the data fabric seeks to build a single, virtual management layer atop distributed data, the data mesh encourages distributed groups of teams to manage data as they see fit, albeit with some common governance provisions.
The data mesh concept was first written down by Zhamak Dehghani, who is now the director of next tech incubation at Thoughtworks North America. Dehghani laid out many of the principles and concepts of the data mesh in her May 2019 report “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh,” which she followed up with the December 2020 report titled “Data Mesh Principles and Logical Architecture.”
The core principle driving the data mesh is rectifying the incongruence between the data lake and the data warehouse, as we wrote earlier this year. Whereas the first-generation data warehouse is designed to store largely structured data that’s used by data analysts for backward-looking SQL analytics, the second-generation data lake is used primarily to store largely unstructured data that by the data scientist for building predictive machine learning models. Dehghani writes about a third-generation system (Kappa) marked by real-time data flows and embrace of cloud services, but it doesn’t solve the underlying usability gap between first- and second-generation systems.
Many organizations build and maintain elaborate ETL data pipelines in an attempt to keep the data in synch. This also drives the need for “hyper-specialized data engineers” who are tasked with maintaining the byzantine system working.
The key insight that Dehghani brought to bear on the problem is that data transformation cannot be hardwired into the data by engineers, but instead should be a sort of filter that is applied on a common set of data that’s available to all users. So instead of building a complex set of ETL pipelines to move and transform data to specialized repositories where the various communities can analyze it, the data is retained in roughly its original form, and a series of domain-specific teams take ownership of that data as they shape the data into a product. Dehghani’s distributed data mesh addresses this concern with a new architecture that is marked by four primary characteristics:
- Domain-oriented decentralized data ownership and architecture;
- Data as a product;
- Self-serve data infrastructure as a platform;
- Federated computational governance.
In effect, the data mesh approach recognizes that only data lakes have the scalability to handle today’s analytics needs, but the top-down style of management that organizations have tried to impose on data lakes has been a failure. The data mesh tries to re-imagine that ownership structure in a bottoms-up manner, empowering individual teams to build the systems that meet their own needs, albeit with some cross-team governance.
Mesh Vs. Fabric
As we can see, there are similarities between the data mesh and the data fabric approach. However, there are differences that should be taken into account too.
According to Forrester’s Yuhanna, the key difference between the data mesh and the data fabric approach are in how APIs are accessed.
“A data mesh is basically an API-driven [solution] for developers, unlike [data] fabric,” Yuhanna said. “[Data fabric] is the opposite of data mesh, where you’re writing code for the APIs to interface. On the other hand, data fabric is low-code, no-code, which means that the API integration is happening inside of the fabric without actually leveraging it directly, as opposed to data mesh.”
For James Serra, who is a data platform architecture lead at EY (Earnst and Young) and previously was a big data and data warehousing solution architect at Microsoft, the difference between the two approaches lies in which users are accessing them.
“A data fabric and a data mesh both provide an architecture to access data across multiple technologies and platforms, but a data fabric is technology-centric, while a data mesh focuses on organizational change,” Serra writes in a June blog post. “[A] data mesh is more about people and process than architecture, while a data fabric is an architectural approach that tackles the complexity of data and metadata in a smart way that works well together.”
You can simultaneously use a data mesh and a data fabric, and even a data hub, according to Eckerson Group analyst David Wells
“First, they are concepts, not things,” Wells writes in a recent blog post, “Data Architecture: Complex vs Complicated.” “Data hub as an architectural concept is different from data hub as a database. Second, they are components, not alternatives. It is practical for architecture to include both data fabric and data mesh. They are not mutually exclusive. Finally, they are architectural frameworks, not architectures. You don’t have architecture until the frameworks are adapted and customized to your needs, your data, your processes, and your terminology.”
Both data meshes and data fabrics have a seat at the big data table. In the search for architectural concepts and architectures to support your big data projects, it all comes down to finding what works best for your own particular needs.