Follow Datanami:
July 12, 2018

How the Machine Learning Catalogs Stack Up

via Shutterstock

You can’t do anything with data – let alone use it for machine learning – if you don’t know where it is. In the age of big data, this is not a trivial matter. It is also the main driver that’s propelling the rise of machine learning data catalogs, which the analysts at Forrester recently ranked and sorted. Just a word of warning: the name at the top of the list might surprise you.

According to Michelle Goetz’s June 21 Forrester Wave report, the percentage of analytic decision makers managing more than 1 petabyte of data (either structured, semi-structured, or unstructured) has essentially tripled from 2016 to 2017. That rapid growth has exposed all manner of problems in company’s existing data management and analytic endeavors.

Two of the biggest challenges that companies face today, Goetz writes, are gathering and managing data in a governed manner on the one hand, and managing the business processes that surround the data analytics activities on the other.

“For EA [enterprise analytics] professionals, relying on people and manual processes to provision, manage, and govern data simply does not scale,” the Forrester analyst writes. “Enterprises are waking up to this fact and turning to data catalogs to democratize access to data, enable tribal data knowledge to curate information, apply data policies, and activate all data for business value quickly.”

In the 2Q 2018 Forrester Wave for Machine Learning Data Catalogs, Goetz and company identify 12 data catalog software providers that should be on your radar. The report, which you can download courtesy of Alation (who was, not coincidentally, featured in the report) ranked these vendors across 29 different factors. Based on the vendors’ score across those factors, Forrester divided the vendors up across three groups, including Leaders, Strong Performers, and Contenders.

Here’s how Forrester ranked the various machine learning data catalog (MLDC) vendors across those categories:

MLDC Leaders

IBM came out on top in this particular analysis. Forrester says that Big Blue “reimagined data” with its various offerings, including the Watson Knowledge Catalog, which apparently included many of the features that Goetz was looking for. In particular, the Forrester analyst appreciated how IBM designed its user interface, which hasn’t always been a strong suite for IBM product design teams.

“The UI lets roles work the way they want to and not reorient their data sourcing, stewardship, or administrative processes to match another role’s workspace,” Goetz writes. The one caveat to IBM’s good showing is the relative newness of the Watson Knowledge Catalog, which was just launched in enterprise mode and didn’t yet have all the features needed, such as full lineage analysis (due out this summer).

Coming in right behind IBM was Reltio, which is best known for being a master data management (MDM) provider. However, Forrester says Reltio didn’t let its MDM origins prevent it from offering compelling value as a data catalog, although it does take some getting used to. Data engineers and data stewards should be comfortable in the cloud provider’s self-service setting, according to Forrester.

Unifi Software took home third place in the rankings thanks to the simplicity of its Unifi Data Platform that elevate the user’s intent, according to Forrester. The product’s natural language interface, which allows users to ask questions about the data, drew favorable reviews from Forrester. The one drawback was the data science workbench, but the analyst group said that shouldn’t keep Unifi off customers’ shortlist, especially thanks to the way that Unifi mixes data preparation and self-service in with the data catalog functionality.

Coming in fourth was Alation, which Forrester credits with kicking off the MLDC trend back in 2012. The analyst group was impressed with how the Alation Data Catalog provides “deep data introspection with its behavioral I/O analysis of data use and queries,” and said customer satisfaction is high with the product, which provides “strong MLDC” for making sense of vast and dispersed sources of data. However, while Forrester applauded how Alation partners with other vendors, such as Trifacta and Paxata, for data preparation functionality, it also stated that Alation would need to grow its product footprint in a maturing data catalog market.

Collibra rounds out the leaderboard with its MLDC solution. While the company is best known for providing data governance capabilities, Collibra has expanded its Data Governance Center offering to support MLDC functionality, including support for managing data models, schemas, classification, tagging, and certification. The company still has work to do to differentiate itself from strong competitors, but the company says its worth keeping Collibra on the shortlist.

MLDC Strong Performers

Informatica had a decent showing in Forrester’s review despite not taking any steps to participate in it. The company’s Enterprise Data Catalog offering has been evolved from its original focus on metadata management and business glossary functionality to support the types of features that are expected in an MLDC. However, Forrester found governance and stewardship lacking.

Oracle was a bit of a surprise with its Oracle Enterprise Metadata Management (OEMM) offering. “Don’t be fooled by the legacy name,” write Goetz, who was impressed with how OEMM incorporated metadata and models that existed in places like ETL tools and even Apache Kafka. Forrester expects Big Red to improve its showing in the future, thanks to the acquisition of


Waterline Data also impressed Goetz with its offering, which she wrote “keeps back the big data swamp monster.” The capability to automatically ingest raw data and deliver deep profiling of tracked data, while helping customers incorporate tribal knowledge that connects systems, helped to offset the one downside: three-to-four month deployments, which was longer than average. Companies that need to “de-swamp” their data lakes will appreciate the Waterline Data Catalog, Goetz writes.

Infogix is another data management vendor that’s grown its capabilities in the MLDC arena. Initially focused on data governance, Infogix has added data cataloging, quality, and stewardship capabilities, largely on the back of its acquisition of Lavastorm. Forrester reports high levels of satisfaction with its offering, Data3Sixty. While there’s room for improvement in providing a “marketplace experience” and machine learning functionality, Forrester sounds optimistic that the company will address them.

Cambridge Semantics offers a range of machine learning and analytic capabilities in its Anzo Smart Data Lake offering, including text and graph analytics. You can also find ETL, data catalog, and data collaboration functionality built in that will help groups of users to interpret and standardize complex data. While customers sounded upbeat about the software, they also looked forward to refinements, according to Forrester.

Cloudera was the final player in the Strong Performer segment. Goetz was generally upbeat on the Hadoop distributors MLDC functionality in Cloudera Enterprise. “Cloudera offers advanced cataloging with sophisticated ML capabilities to understand, classify, and catalog data ingested into the data lake,” she writes. The one downside to Cloudera’s offering is that it assumes users have SQL and database expertise. Customers were generally appreciate of Cloudera’s offering but said there was room for improvement.

MLDC Contenders

Hortonworks was the loan entry in the Contender segment of this Forrester Wave. Goetz says the company’s Data Steward Studio offered lots of functionality for data stored within the Hortonworks ecosystem, thanks to “extensive metadata capture about data, data models, and schemas from source systems at the file, table, and column level.” However, Goetz knocked Hortonworks for the lack of visibility into data about assets, policies, and lineage.

The Forrester Wave: Machine Learning Data Catalogs, Q2 2018

Related Items:

Dude, Where’s My Database? And Other GDPR Questions

Data Catalogs Emerge as Strategic Requirement for Data Lakes

This Catalog Recommends Data with Machine Learning