October 24, 2017

Keeping Your Models on the Straight and Narrow

Alex Woodie

Organizations of all stripes have adopted predictive models to help drive decision-making. But an increased reliance on machine learning can quickly turn into an expensive liability if bias, privacy, and other concerns aren’t adequately addressed. With laws like GPDR looming, it’s becoming critical to keep predictive models on the straight and narrow.

Big data analytics is a double-edge sword. It’s tough to argue against that. On the one hand, data illuminates hard-to-see patterns that help businesses be more profitable, and it helps people live happier lives. We explore the possibilities of data analytics every day here in the virtual pages of Datanami.

But there’s a dark side to big data analytics too, one that finds consumers’ sensitive data leaking out of databases, and people’s hidden biases creeping into models. With GDPR set to take effect in less than eight months, the anything-goes, Wild-West days of firms doing whatever they can with data are about to end. Regulators are sharpening their pencils in anticipation of the need to police extensive violations of Europeans’ data by American firms. The time to get ahead of the curve was probably 2016. But, as they say, better late than never.

One software company at the forefront of the coming GDPR big data smackdown is Immuta, a data governance startup based in College Park, Maryland. Immuta was founded by Matthew Carroll, who previously advised US intelligence leaders on data management and analytics issues while working as the CTO Computer Sciences Corporation’s Defense and Intelligence group (CSC merged with HP Enterprise Services to create DXC Technology in April).

Carroll will be the first to tell you that striving to completely eliminate bias is a fool’s errand. “Bias is inherent in all data science,” Carrol begins an interview with Datanami by saying. “You can’t remove pure bias out of data, otherwise you probably wouldn’t have any statistically cogent decisions to make.”

But just as the perfect is the enemy of the good, there are definitely improvements that data-driven organizations can make to reduce the impact of bias in their models. Instead of closing one’s eyes and hoping it all turns out well, Carroll suggests we approach the problem with our eyes wide open.

“The broader concern here is understanding when we’re at risk,” he says. “It’s understanding when you’re going to enter too much bias to the point where you’re going to make unethical or illegal decisions on data.”

Immuta this week will announce a partnership with machine learning automation firm DataRobot that will make it easier for customers to combat creeping bias while simultaneously stemming the seemingly uninterrupted flow of people’s private data into hackers’ hands.

DataRobot already provides some protections against bias by automating many of the tasks that data scientists would normally handle in the process of building, testing, and deploying machine learning models into production.

“We’re paranoid about overfitting models,” says DataRobot COO Chris Devaney. “We’ve built best practices into the framework so you’re not relying on an individual to detect something. The automation can help do that.”

Matthew Carroll, CEO and co-founder of Immuta

DataRobot decided to work with Immuta to take those protections to the next level. The two companies are set to announce that they’ve integrated their respective products, whereby Immuta’s software will essentially filter data before it gets into the DataRobot environment. The goal is to help ensure users that potentially biased or regulated data is stopped in its tracks before it gets into algorithms and models.

Andrew Gilman, Immuta’s chief customer officer, says the integration has the potential to save DataRobot customers month’s worth of time while also lowering the risk of models going awry or running afoul of regulations due to undetected use of biased data.

“Prior to Immuta coming in, they had to copy, move, and prepare each dataset differently based on the user and their entity, making sure that the internal policies were applied and the external laws related to the data were being enforced,” Gilman says. “It could take five months on a per analyst basis to get the data to the data scientists and connect to a tool to be able to go and run analytics.”

That timeframe will shrink considerably by leveraging the Immuta tool, which was designed to be lawyer-friendly. As opposed to configuring data rules in esoteric terms, Immuta’s software lets lawyers set restrictions in a language they can understand, and then automatically execute them at a granular level. “The gating factor now, Gilman says, “is connecting to the data, controlling the data, and making sure that the compliance and regulation and everything is being audited as it goes through the system.”

In addition to masking and filtering data, Immuta can be used to provide “differential privacy” of data that guarantees anonymity, even as the data and the models change over time. “We offer all these techniques to lawyers so they don’t have to implement any code and they can enforce the policies as they see fit, rather than asking a data scientist or a data engineer to do it,” Carroll says. “The policies can actually be directly enforced on the metadata.”

The Immuta solution can also deploy automated rules that prevent bias from creeping into the data in the first place. For example, a data scientist working on an insurance pricing model would mask the field that describes somebody’s race from entering the model. There are federal laws forbidding insurers from factoring race into insurance offers. (People’s ages, however, are allowed).

However, the model may begin behaving in such a manner that it’s trying to infer race anyway. In that situation, Immuta can inject enough random noise into the model to steer it away from trying to make decisions based on race.

“As it gets closer and closer to potentially getting to the point where it’s inferring those things, we’ll add more and more noise,” Carroll says. “So the answer gets further and further statistically away from the actual correct answer, and so we can actually act like an adversary to the model in that case.”

Preventing such “link attacks” are a common task for data scientists with deep mathematical and statistical backgrounds. But knowing what to look for, and how to prevent them from occurring, are not necessarily things that data analysts and other less-technical DataRobot customers are expected to know.

“Before we had all these techniques, you had to manually implement it yourself, which would have required a very strong math background,” Carroll says. “What we’ve done is made it like Photoshop. We now have you click the filter button and it just works.  We basically made clickable filters, depending the feature, to prevent people from misusing the data or prevent link attacks, where they could link to another aspect of the data to create bias.”

DataRobot’s Devaney says the Immuta solution will give DataRobot customers in healthcare, financial services, and federal agencies the confidence to continue to innovate with data and to explore new ways analytics can benefit their organizations.

“It’s what’s being demanded now in our highly secured customers in federal government, banking, healthcare, insurance,” he says. “This is the new bar. It’s not just authentication and authorization. It’s all of these other things … to give them confidence as they apply these [models] in the most sensitive areas of the business.”

Related Items:

Who Controls Our Algorithmic Future?

DataRobot Delivers an ML Automation Boost for Evariant

Don’t Be a Big Data Snooper


Share This