Dude, Where’s My Database? And Other GDPR Questions
Is your organization ready for the General Data Protection Regulation (GDPR), which goes into effect in exactly one month? If you’re like most organizations, you still have a lot of work to do over the next 30 days – especially if you have entire databases on the lam.
Only 46% of global companies report that they will be compliant with GDPR when it goes into effect on May 25 according to a new survey of 183 business people sponsored by SAS. The survey, which the analytics giant released today, found that 53% of companies in the European Union reported being ready for the GDPR, while 30% of businesses in the United States reported being ready.
How close are companies cutting it? Well, the SAS survey says that only 7% of the companies would have been ready for the GDPR if it gone into effect in February, which is when the survey was taken. Clearly, there’s quite a bit of work to do.
Identifying sources of stored personal data was the top challenge organizations face in preparing for GDPR, according to the SAS survey, followed by acquiring the skills required to be GDPR compliant. Each violation of the GDPR can be punished with fines equal to 4% of a company’s annual revenue, which means companies have a big incentive to ensure they’re good stewards of personal data, even if they’re having trouble finding it.
The challenge is, the typical company today has data spread out in many different repositories, which makes figuring exactly where that data resides a formidable obstacle. One company that’s helping with this challenge is Alation, which develops data cataloging software used by eBay, Square, and Pfizer.
Alation CEO Satyen Sangani says the company’s goal is eventually to tag and track every piece of data as it transits through an organization. That would have been a lot easier when most of a company’s data resided in a relational database and was accessed by professional ETL engineers and data analysts. But thanks to the big data and self-service phenomena, the source of data and the user are much more diffuse than they used to be.
“How do I actually discover this data first and then how do I tag it?” Sangani tells Datanami. “Sometimes it’s inside of a file system, sometimes it’s inside of a dashboard, sometimes it’s inside of a table… You have GDPR uses cases where people are saying, ‘Look I have a ton of PII [personally identifiable information]. I don’t even know what information exists.'”
Sometimes companies have PII spread across geographic regions, and there’s often duplicate data. Rationalizing all of this data can be extremely difficult, but it’s one that’s required before the company can take the next step in GDPR compliance — establishing a lineage of consents from everybody whose data you’re storing, and even the creation customer 360 views.
The challenge is so great that some companies have resorted to forming data hunting teams. Alation recently worked some one such team in the insurance industry. “They call them data hunters,” Sangani says. “Their entire job is to figure out what data sets are going to allow them to solve analytical problems and price policies and price insurance better.”
Crouching Tiger, Hidden Data
Another company actively involved in GDPR remediation projects is Waterline Data. The company’s data catalog is used by companies like Commerzbank and GlaxoSmithKline, and can track data wherever it sits, including Hadoop, relational databases, object file systems, and other locations.
Just figuring out where data exists is a major challenge for many organizations today, says Mohan Sadashiva, SVP Product at Waterline Data. However, it didn’t used to be that way.
“We used to have very effective data governance practice in the form of DBAs,” says Sadashiva, referring to database administrators. “The past few years with more open source and more Hadoop…going on, we lost those controls. We no longer have the same gatekeepers who were managing the storage infrastructure to impose order. And so order stopped happening, and the pendulum has swung to the other side.”
Not only have organizations lost track of data over the last few years, according to Sadashiva – they’ve lost track of entire databases.
Sadashiva says he’s has had conversations with customers who suddenly need to expand their Waterline Data licenses to cover more databases. One Waterline customer increased the number of databases it was cataloging from 4,000 to 40,000 databases over two years, he told Datanami during the recent Strata Data Conference.
“It wasn’t because 36,000 databases were created during that time,” Sadashiva says. “No, they found them. They were running packet sniffers on the network looking for SQL traffic and finding database servers. Someone went and created one under the desk and were running part of their business on it. Who knew?!”