June 15, 2017

Breaking Down the Seven Tenets of Data Unification

Alex Woodie


One of the longstanding challenges in analytics is data unification. While federated approaches are gaining some favor, the vast majority of analytic practitioners want the data to be present in one place before analyzing it. This means that data from different entities must be unified, and that’s a problem.

In the new white paper “The Seven Tenets of Scalable Data Unification,” renowned computer scientist and Tamr cofounder and CTO Michael Stonebraker lays out the challenge in his unique and plainspoken style.

Stonebraker – who is also a Turing Award winner, an MIT professor, and the creator of Vertica — starts the white paper by describing data unification, which he says consists of seven steps, including ingesting, cleansing, transformation, schema integration, deduplication, classification, and exporting (but don’t confuse these seven attributes with the tenets).

Companies have traditionally used two main approaches to tackling data unification, including extract transform and load (ETL) and master data management (MDM). Each has advantages and disadvantages, Stonebraker says.

ETL is flexible and adaptable to different data sources by enabling programmers to hand-code transformation routines to ensure the source data’s schema matches up with the global schema chosen for the centralized data warehouse project. Because of the lack of automation, Stonebraker says few companies have the ETL bandwidth to go beyond 20 data sources.

MDM is similar to ETL in that it presupposes there’s a “master record” that every file of particularly category (such as customers, parts, and suppliers) should adhere to. But instead of using custom-written scripts, as ETL does, it relies on a series of “fuzzy merge” rules to hammer all the disparate files into the master format.

However, neither the ETL nor the MDM approach will work with all data unification challenges, particularly at scale, according to Stonebraker. These limitations are what define his Seven Tenets of Data Unification.

Because of the volume of todays’ big data sets and the huge demands they place on programmers, any scalable data unification project must be automated to a large degree. It can’t rely on hand-coded scripts. That leads to Stonebraker’s first tenet:

“Any scalable system must perform the vast majority of its operations automatically.”

The variety of today’s data also presents a problem. When the drug company Novartis sought a way to unify the notes of 10,000 scientists doing “wet lab” work, it was faced with a global schema problem. In short, flexible schema-on-read approach is the only way to handle the variety problem. That leads to Stonebraker’s second tenet:

“Schema-first products will never scale. The only option is to run a schema-last product.”

While automation is a critical factor in data unification, there’s simply no way to replace human experts from the loop. In the case of Novartis, only the scientist herself is capable of confirming that a particular piece of data, such as the name of a new chemical compound, is correct and not a misspelling. That leads to the third tenet:

“Only collaborative systems can scale when domain specific operations are required.”

Scalability is a must in big data unification. When the data sets routinely exceed 10 million individual files, they are simply too big for a single computing core, let alone a single chip or a single computer, to process alone. That brings us to the fourth tenet:

“To scale, any unification computation must be run on multiple cores and multiple processors.”

Clustering algorithms lie at the heart of modern data unification and data cleansing tools. And while these algorithms are parallelized, if they’re too complex, they’ll take too long to run. That leads Stonebraker to his fifth tenet:

“Even with Tenet 4, a parallel algorithms with lower complexity than N ** 2 is required for truly scalable applications.”

Many MDM products use a rules-based approach to define the transformations. But with the volume and variety of data that enterprises are trying to unify these days, those approaches won’t work. This made up Stonebraker’s sixth tenet:

“A rule system implementation will not scale. Only machine learning systems can scale to the level required by large enterprises.”

Finally, a data unification system must be adaptable to how the customer works. While it may be technically possible to take a “brute force” approach to updating each record as it changes, if the data is large or changes frequently, it would create a mess. That leads Stonebraker’s to his seventh and final tent of data unification:

“Incremental unification in real time must be supported.”

According to Stonebraker, the ETL approach will fail to abide by tenets one, two, and three, while the MDM approach will fail tenets one, two, and six. The self-service data preparation approach, which has been popular in the big data arena, will fail “at least” tenets one and three. He also says all the products – ETL, MDM, and self-service data prep – are likely to fail tenets five and seven.

We’ll leave you to guess which data unification vendor’s product Stonebraker says will satisfy all seven tenets. (Hint: it starts with a “T”).

Related Items:

GE Invests on Data Prepper Tamr

Why Self-Service Prep Is a Killer App for Big Data






Share This