Follow Datanami:
February 5, 2015

Will Big Metadata Rat You Out?

Data scientists are usually taught to be cautious with personally identifiable (PII) data and take pains to ensure that it’s properly anonymized and aggregated for authorized uses. But as MIT researchers recently showed, it’s quite possible to “reidentify” a person by analyzing credit card metadata even after it has been anonymized.

Writing in the magazine Science, four MIT researchers demonstrated a method that can be used to identify individuals by finding links among relatively course-grained credit card metadata. While the researchers used credit card data, we’re told that just about any large data set with the appropriate level of granularity would be sufficient.

“We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90 percent of individuals,” the researchers write.

The likelihood of reidentificaiton from previously anonymized data, which the researchers termed “unicity,” was determined mathematically by using semantic triples, or “tuples.” When enough tuples are gathered to positively identify a person with a certain level of accuracy, that person’s identity has effectively been teased out of the data.

The researchers demonstrate this approach by searching for somebody known as “Scott” within a simply anonymized credit card data set. “Simply anonymized” means there are no obvious identifiers like names or account numbers or addresses, and the data hasn’t been scrambled beyond recognition.

“We know two points about Scott: he went to the bakery on 23 September and to the restaurant on 24 September,” the researchers write. “Searching through the data set reveals that there is one and only one person in the entire data set who went to these two places on these two days.” By connecting the dots between space and time, the researchers not only can identify these two transactions, but all of Scott’s other transactions over the time period, including how much he spent.

Researchers showed that you can be “reidentified” from anonymized data by linking as few as four data points that include space and time metadata.

Researchers showed that you can be “reidentified” from anonymized data by linking as few as four data points that include space and time metadata.

Knowing the value of a transaction makes a particularly good identifier using this approach. “We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average,” the reserachers write. “Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.”

The findings should be a wake-up call to regulators, credit card companies, data brokers, and individuals who are concerned about privacy in a rapidly evolving digital landscape. As they’re currently spelled out, US and UK regulations would fail to stop a rogue data scientist or cybercriminal organization from copying the MIT researcher’s techniques to obtain PII from publicly available data sets.

The researchers show that, even if a data set doesn’t contain information that falls under the legal definition of PII–things like name, address, and social security numbers—that it doesn’t mean that somebody with the right tools and training can’t reverse-engineer that information out.

This is potentially a blow to the “open data” initiatives that have cropped up over the past few years, such as the Boston transportation authority’s practice of publicly releasing the real-time position of all public rail vehicles, and the Orange Group’s releasing of large samples of mobile phone data through its Data for Development program, which the researchers cited in their article.

“Making these data sets broadly available, therefore, requires solid quantitative guarantees on the risk of reidentification,” the researchers write. “A data set’s lack of names, home addresses, phone numbers, or other obvious identifiers… does not make it anonymous nor safe to release to the public and to third parties.”

Related Items:

Big Data Practitioners Ponder Privacy Issues

Privacy Protections Needed as Big Data Advances, White House Says

Big-Data Backlash: Medical Database Raises Privacy Concerns