Tamr’s Data Prep Platform Gains U.S. Patent
A new approach for integrating large numbers of data sources using a combination of machine learning techniques along with human expertise has earned U.S. patent protection.
Data preparation specialist Tamr Inc. said Thursday (Feb. 9) the U.S. Patent and Trademark Office has awarded a patent (US9,542,412) covering its “data unification” platform. The company’s machine learning approach is used to prepare data from multiple sources by “normalizing, cleaning, integrating, and de-duplicating” data sources.
“Our goal was to build an end-to-end system for enterprise-scale data curation that leveraged modern machine learning techniques to radically reduce the time and cost of producing clean, unified data sets,” explained Mike Stonebraker, Tamr’s co-founder and CTO.
The patent describes new features implemented in the company’s software. These include the techniques used to obtain training data for machine learning algorithms along with a methodology for linking attributes and database records. It also describes various methods for “pruning the large space of candidate matches for scalability and high data volume considerations,” the company said.
The data unification system features “data cleaning” of raw data that is both “dirty” and “noisy,” along with extensive use of automation algorithms along with human intervention as needed to scale the platform.
Other features include incremental data integration and curation. “New data sources must be integrated incrementally as they are uncovered,” the company noted. “There is never a notion of the data integration task being finished.”
The startup based in Cambridge, Mass., was spun out of the Massachusetts Institute of Technologies’ Computer Science and Artificial Intelligence Laboratory in 2014. It differentiates itself from a growing number of data prep specialists who apply rules to combine a limited number of data sources. By contrast, Tamr said it approach combines machine-learning techniques with human experts. That, the startup asserts, allows it to scour data for correlations and duplications in hundreds of source files.
The U.S. patent award comes as the data preparation market is booming. Market researcher Gartner predicted last year that the self-service data preparation software sector could reach $1 billion by 2019, and that the current adoption rate of 5 percent would grow to 10 percent by 2020.
Tamr’s machine learning approach seeks to exploit the dirty data problem in pursuit of software license and maintenance revenue. The startup, which was launched by Vertica founders Stonebraker and Andy Palmer, uses a combination of machine learning algorithms and crowd-sourced human oversight to automate much of the work that goes into combining and integrating siloed, semi-structured data so that it can be more effectively utilized in analytic systems.
Along with patent award, Tamr has raised $41.2 million in two funding rounds, including a $25.2 million Series B round closed in June 2015. Among Tamr’s early investors are Google Ventures (NASDAQ: GOOGL) and New Enterprise Associates.