How ML Helps Solve the Big Data Transform/Mastering Problem
Despite the astounding technological progress in big data analytics, we largely have yet to move past manual techniques for important tasks, such as data transformation and master data management. As data volumes grow, the productivity gap posed by manual methods grows wider, putting the dreams of AI- and machine learning-powered automation further out of reach. Can ML itself help to close that gap? Mike Stonebraker sure thinks so.
Stonebraker — who had a direct hand in creating the Postgres and Vertica database, and is currently behind startups like Tamr, VoltDB, and Paradigm4 — is quite bullish on the potential of machine learning. “The general thinking is machine learning is the answer. Now what was the question?” he says.
In all seriousness, the data transformation and data mastering problems are quite challenging, in Stonebraker’s view. Companies across multiple industries are eager to use machine learning with their stockpiles of data to gain a competitive advantage. But the old snafus of dirty, unintegrated, incomparable, and mismatched data keep cropping up, putting a crimp in companies’ big data plans.
“I talk to a lot of data scientists who do machine learning, and they all report they spend 90% of their time finding integrating, fixing, and cleaning their input data,” Stonebraker says. “People don’t seem to realize that a data scientist is not a data scientist. He or she is a data integrator.”
The good news is that machine learning itself can help with machine learning. That idea is to use the predictive power of algorithms to mimic some of the tasks of the human data integrator. It’s not going to be a 100% solution, but it can help to ease the work blockage and get data scientists moving towards the truly innovative work they’re being paid so well to do.
“The answer is buy ML wherever you can,” Stonebraker says. “You can apply it to help you with the transform piece of ETL, by using ML to guess the transform.”
Transforming and Mastering Data
While they are similar in certain respects, there are important differences between data mastering and data transformation, which Stonebraker touches on in a recent Tamr blog piece titled “Data Mastering at Scale.”
Data transformation is the first step in the data integration process, and the goal is to transition disparate data into a common global schema, which the organization lays out ahead of time. Automated scripts are typically employed to convert US dollars into Euros, for example, or pounds into kilograms.
After the transformation phase, the analyst proceeds to data mastering. The first step often involves running a “match/merge” function to create clusters of records that correspond to the same entity, such as grouping together different but similar spellings of name. Concepts like “edit distance” may be employed to determine how close or far about two different entities may be.
More rules are then used to compare the various entities to identify the best value for a given record. The company may declare that the last entry is the best, or the common value among a group of values, is the one to use. In this way, the title of “golden record” is bestowed upon the best piece of data.
This general two-step process has been used in many data warehouse implementations over the decades, and it continues to be used in the modern era of data lakes. However, ETL and data mastering have largely failed to keep up with the volumes of today data and the scale of the challenges confronting businesses, Stonebraker argues.
For instance, the need to have a global schema defined upfront stymies many ETL efforts that seek to integrate dozen of more data sources. At a certain point, programmers just can’t keep up with the volume of data transformation rules that must be set.
“If you have 10 data sources, you can imagine somebody doing it,” Stonebraker tells Datanami. “If you have 10,000, that’s out of the question.”
Obviously, a different approach is needed.
In a small organization, you may be able to get away with creating a global data schema in advance, and then enforcing its use throughout the organization, which will eliminate the need for painful, expensive ETL and data mastering projects to try and piece it back together in a data warehouse. But in larger organizations, that top-down method inevitably fails, Stonebraker says.
Even if the business units in a larger organization closely resemble each other, there will be small differences in the way they record data. Those small differences will need to be accounted for and collapsed into a single trusted entity before meaningful analytics can be performed upon it. This is simply a reflection of the nature of enterprise data, Stonebraker says.
“What happens is that enterprises decompose into independent businesses, and the reason they do that is to get stuff done, because otherwise you have to ask God every time you want to do anything,” he says. “And so business agility demands a level of independence, which means that every business unit builds its own silo.”
For example, consider Toyota Motor Europe, which has separate customer support organizations in each country it does business. The company wanted to create a master record of all the entities existing across 250 databases, which entail 30 million records in 40 different languages.
The problem for Toyota Motor Europe is that an ETL and data mastering project of this magnitude is enormous, and will consume a great deal of resources if gone about the traditional way. Instead of drawing the data transformation arrows and employing data mastering processes, the company decided to engage Tamr to help address the challenge with machine learning.
“ETL’s biggest problem is it says you have define the global schema upfront. I don’t know how to do that at scale,” Stonebraker says. “Tamr does bottom-up matching and constructs the target schema bottom up, basically using machine learning. At scale, that’s the only possible way it’s going to work.”
That doesn’t mean that machine learning provides an “easy button” to solve these hairy data integration issues. It still requires lots of data and processing power, and you typically need to have one your smartest and most highly paid employees helping to guide the software to get the right answers.
“This is not cheap in processing time, but generally speaking, that’s not the high pole in the tent,” Stonebraker says. “If you have a database of German suppliers and I have a database of US suppliers and I want to do supplier mastering, then an obvious issue that will certainly come up is, is Merck in Germany the same as Merck in US? Well, only your CFO or some smart person has any chance of knowing the answer.”
These data questions can’t be outsourced to other firms, so forget about using Mechanical Turk to clean up your own data. A human must be in the loop. With that said, the human ideally will have a productivity aid. For Stonebraker, that aid is Tamr.
“That’s what Tamr does. It puts together independently constructed databases solving all of these problems,” he says. “The good news is that it used to be prohibitively expensive. But you apply ML and statistics, and it’s getting a lot cheaper. Now it’s becoming doable whereas 20 years ago, you wouldn’t even contemplate trying to do it.”