Yes, You Can Do AI Without Sacrificing Privacy
In general, the more data you have, the better your machine learning model is going to be. But stockpiling vast amounts of data also carries a certain privacy, security, and regulatory risks. With new privacy-preserving techniques, however, data scientists can move forward with their AI projects without putting privacy at risk.
To get the low down on privacy-preserving machine learning (PPML), we talked to Intel’s Casimir Wierzynski, a senior director in the office of the CTO in the company’s AI Platforms Group. Wierzynski leads Intel’s research efforts to “identify, synthesize, and incubate” emerging technologies for AI.
According to Wierzynski, Intel is offering several techniques that data science practitioners can use to preserve private data while still benefiting from machine learning. What’s more, data science teams don’t have to make major sacrifices in terms of performance or accuracy of the models, he said.
If sometimes sounds too good to be true, Wierzynski admits. “When I describe some of these new techniques that we’re making available to developers, on their face, they’re like, really? You can do that?” he said. “That sounds kind of magical.”
But it’s not magic. In fact, the three PPML techniques that Wierzynski explained to Datanami–including federated learning, homomorphic encryption, and differential privacy–are all available today.
Data scientists have long known about the advantages of combining multiple data sets into one massive collection. By pooling the data together, it’s easier to spot new correlations, and machine learning models can be built to take advantage of the novel connections.
But pooling large amounts of data into a data lake carries its own risks, including the possibility of the data falling into the wrong hands. There are also the logistical hassles of ETL-ing large amounts of data around, which also opens up the data to security lapses. For that reason, some organizations deem creating large pools of data too risky for some data.
With the trick of federated learning, data scientists can build and train machine learning models using data that’s physically stored in separate silos, which eliminates the risk of bringing all the data together. This is an important breakthrough for certain data sets that organizations could not pool together
“One of the things that we’re trying to enable with these privacy-preserving ML techniques is to unlock these data silos to make data source that previously couldn’t be pooled together,” Wierzynski said. “Now it’s OK to do that, but still preserve the underlying privacy and security.”
Intel is working with others in industry, government, and academia to develop homomorphic encryption techniques, which essentially allow sensitive data to be processed and statistical operations to be performed while it’s encrypted, thereby eliminating the need to expose the data in plain text.
“It means that you can move your sensitive data into this encrypted scape, do the math in this encrypted space that you were hoping to do in the raw data space, and then when you bring the answer back to the raw data space, it’s actually the answer you would have gotten if you just stayed in that space the whole time,” he said.
Homomorphic encryption isn’t new. According to Wierzynski, the cryptographic schemes that support homomorphic encryption have been around for 15 to 20 years. But there have been a number of improvements in the last five years that enable this technique to run faster, and so it’s increasingly one of the tools that data scientists can turn to when handling sensitive data.
“One of the things my team has done specifically is around homomorphic encryption is to provide open source libraries,” Wierzynski says. “One is called HE Transformer, which lets data scientists use their usual tools like TensorFLow and PyTorch and deploy their models under the hood using homomorphic encryption without having to change their code.”
There are no standards yet around homomorphic encryption, but progress is being made on that front, and Wierzynski anticipates a standard being established perhaps in the 2023-24 timeframe. The chipmaker is also working on hardware acceleration options for homomorphic encryption, which would further boost performance.
One of the bizarre characteristics of machine learning models is the capability to extract details of the data used to train the model just by exercising the model itself. That’s not a big issue in some domains, but it certainly is a problem when some of the training set contains private information.
“You definitely want your machine learning system to learn the key trends and the core relationships,” Wierzynski said. “But you don’t want them to take that a step too far and now kind of overlearn in some sense and learn aspects of the data that are very idiosyncratic and specific to one person, which can then be teased out by a bad person later and violate privacy.”
For example, say a text prediction algorithm was developed to accelerate typing on a mobile phone. The system should be smart enough to be able to predict the next word with some level of accuracy, but it should not return a value when a phrase like “Bob’s Social Security number is….” is typed in. If it does that, then it’s not only learned the rules of English, “but it’s learned very specific things about individuals in the data set, and that’s too far,” Wierzynski said.
The most common way to implement differential privacy is to add some noise to the training process, or to “fuzz” the data in some way,” Wierzynski said. “And if you do that in the right amount, then you are still able to extract the key relationships and obscure the idiosyncratic information, the individual data,” he continues. “You can imagine if you add a lot of noise, if you take it too far, you’ll end up obscuring the key relationships too, so the trick with these use cases is to find that sweet spot.”
ML Data Combos
Every organization is different, and chief data officers should be ready to explore multiple privacy-preserving techniques to fit their specific use cases. “There’s no single technology that’s a silver bullet for privacy,” Wierzynski said. “It’s usually a combination of techniques.”
For example, you might want to fuzz the data a bit when utilizing federated learning techniques, Wierzynski said. “When you decentralize the learning, the machine learning model usually needs additional privacy protection just because the intermediate calculations that go between users in federated learning can actually reveal something about the model or reveal something about the underlying data,” he said.
As data privacy laws like CCPA and GDPR proliferate, organizations will be forced to account for privacy of their customers’ data. The threat of steep fines and public shaming for mishandling sensitive data is a strong motivator for organizations to enact strong data privacy and security standards.
But these laws also potentially have a dampening effect on advanced analytics and AI use cases. With PPML, organizations can continue to explore these powerful AI techniques while working to minimize some of the security risks associated with handling large amounts of sensitive data.