Follow Datanami:
January 15, 2019

A Personal Data Discovery Solution Powered by ML


Thanks to GDPR and other data regulations, companies are required to have a certain degree of insight and control of the data they collect and store about individual people, or risk paying hefty fines. But getting the necessary level of insight and control is a big challenge, thanks in part to the way companies store data and the large number of data repositories they use. Now a startup called BigID is using machine learning to help companies identify everywhere individuals’ data resides and how it’s being used.

As a product category, data discovery is a well-trod area of the market. There are many vendors pitching solutions that can find sensitive data, such as Social Security numbers, for example, or credit card numbers hidden in relational databases, file systems, and sundry other locations companies end up stashing data.

But according to Dimitri Sirota, CEO of BigID, standard data discovery tools don’t go far enough in determining the context around the information that companies have stored about individuals. That extra granularity is necessary thanks to the General Data Protection Regulation (GDPR) that give individuals new privacy controls, including the power to demand companies tell them exactly how their data is collected and used.

“The problem with privacy is that you have to find not just PII [personally identifiable information] but PI, all personal information,” Sirota tells Datanami. “And what that means is you have to be able to figure out if it’s personal based on the context to a person.”

Standard data discovery tools can ferret out pieces of PII like names, addresses, phone numbers, and email addresses fairly easily. But under the new data privacy paradigm, individuals are empowered to know about other pieces of data, such as IP addresses, GPS coordinates, and cookies that companies often collect and increasingly use as part of their analytic workflow.

BigID came to market about 18 months ago with a solution designed help companies get a handle on these types of data. Tackling this problem in a head-on, relational or SQL-esque manner would require creating and maintaining intricate tables that track each data field for each individual. Master data management (MDM) software vendors found that nearly an impossible task before the big data boom, and today, it’s simply out of the question.

BigID takes a different approach, which is to create a model of what personally identifiable data looks like, and then use algorithms to map out all instances of an individual’s PI, which can then be used to comply with GDPR and 80 other data privacy and security regulations that are in the books or soon will be.

“We leverage machine learning to essentially get an imprint of the kind of data you care about finding,” Sirota says. “So you point us to examples or instances of information that’s representative of [what] you’re interested in finding…As part of the learning, we understand some of the relationships within the data, and then we have these algorithms to essentially go out and look in the neighborhood of the data for other data that could be relevant, that we score based on a variety of parameter, such as proximity and relevancy.”

(Wright Studio/Shutterstock)

Just as Google‘s PageRank algorithm determined the relevancy of pages on the World Wide Web to help users find information they’re interested in more quickly, BigID can find individual data elements that are relevant to a company’s effort to identify PI as part of their data governance effort.

“We kind of build a relevancy map around your data [that says] ‘How connected is this piece of data to that piece of data, across your files, across your relational databases, across your data warehouses, Hadoop, or SAP,'” Sirota says. “The regulations don’t say ‘Look at this data and not that data.’ The regulations say ‘Find Dmitri’s data,’ so you have to be able to look everywhere, including Splunk and wherever else it may be.”

BigID installs wherever companies have PI data stored, in the cloud or on premise. The product works constantly in the background to maintain the map of connected data across dozens of supported data sources, including relational and NoSQL databases, SMB and NFS stores, data warehouses, Google Drive, Microsoft 365, HDFS, and SaaS applications.

Sirota, who co-founded Layer7 Technologies and continued at CA Technologies following its acquisition, says no other vendors are tackling the data privacy and governance challenge in quite the way BigID is doing.

“We have more support for data systems than any other tool, and it’s because we don’t do the traditional approach of classification first,” he says. “We developed this new identity-centric approach that allow us to figure out all the relevant data within in a data set, and then that makes it easier to take a 600 PB problem and reduce it to a 1PB problem. And then we dig in and classify it.”

BigID, which is headquartered in New York and Israel, has raised $46.1 million in venture capital funding in Series A and B rounds, and is growing rapidly. Today it announced plans to expand its sales presence in Europe, Asia, and Latin America in response to growing demand.

Related Items:

Six Months In, GDPR’s Impact Uncertain In the U.S.

California’s New Data Privacy Law Takes Effect in 2020