This Catalog Recommends Data with Machine Learning
Finding the right piece of data can be a big challenge. With the latest release of Collibra’s data governance and data catalog solution, machine learning algorithms help the product learn what types of data you use with the goal of surfacing and recommend new data sources that are appropriate for your job.
This use of machine learning is one of the features in Collibra 5.0, the latest release of the company’s flagship data governance solution that was formally announced yesterday. The Collibra Catalog is one of several applications built into the platform that hundreds of companies use to keep track of big data sitting in Hadoop, Hive, and other locations.
“We have a technology platform that has the capability to keep track of processes around data, the metadata and organizations and roles and who has responsibility for data,” says Daniel Sholler, director of product marketing for Collibra. “We keep track of all the technical connections in all the data because you need to know that stuff. But it turns out that stuff isn’t the interesting stuff.”
The interesting stuff for data scientists, Sholler says, is the data itself. While CIOs and auditors are interested in ensuring there’s security around data and the lineage is accurately tracked, data scientists want to get their hands on the right piece of data as quickly as they can, without going on an extensive hunting expedition.
“With big data, a lot of folks are familiar with the data at a conceptual level,” Sholler said during a briefing at the recent Strata + Hadoop World conference in New York City “They know there’s a customer churn data set out there. What they don’t know is which of the 37 data sets that are labeled ‘customer churn’ is the one they ought to use for their own purpose.”
Collibra is hoping to accelerate data scientists’ access to data by implementing an Amazon-style shopping experience with the latest release of the Collibra Catalog. The Java-based software, which can run on-premise or in the cloud, enables data analysts and scientists to search for data using an intuitive user interface, and select multiple data products to “purchase” through a check-out function.
As the users build up a search history, algorithms monitor the activity, with the goal of issuing recommendations to other data products that the user may be interested in—just like the real Amazon website.
The Collibra software does this while tracking all the data on the backend to ensure that tight controls are kept on the security, lineage, and quality of the data. In this sense, it’s all about putting guiderails around data tasks that are often neither automated nor centralized.
“In a typical data lake, there’s zero marginal cost to put more data into the data lake, so everybody just throws it in there,” Sholler says. “So now we’ve got the pit. Which [data set] is the official one? That’s a business process that we would automate…It means there’s a certain level of trust around the data set. You know the source, and where it came from.”
The idea is to provide a system of record for critical data-oriented functions similar to what corporations have in place for sales, HR, and finance functions, says Collibra co-founder and CEO Felix Van de Maele.
“Collibra 5.0 serves as that ‘system of record,’ providing a data governance backbone that helps organizations increase the value of their data and eliminate data silos,” Van de Maele says. “Our new Collibra Catalog helps eliminate one of the major pain points with which data scientists and business analysts grapple—namely, the time-intensive and tedious process of finding data—and enables them to work more quickly and strategically to solve critical business challenges.”
Collibra, which was founded in Belgium eight years ago, recently moved its headquarters to New York City. The company has customers in heavily regulated industries like healthcare and financial services, as well as higher education and high tech. In addition to a catalog, the Collibra platform includes business glossary, data dictionary, data helpdesk, policy manger, reference data, and stewardship applications.