AI Bias Problem Needs More Academic Rigor, Less Hype
Let’s face it: We’re infatuated with AI. From smart chatbots to image recognition to self-driving cars, we’re absolutely enamored with the superpower-like abilities it gives us. But unless we incorporate stronger processes to identify and remediate biased data and biased algorithms, experts say, we run the risk of automating bad decisions at a truly ghoulish scale.
We’re getting way out in front of our skis with AI, according to Patrick Hall, the principal scientist at the AI-focused law firm bnh.ai and a visiting professor at George Washington University.
“Machine learning has been used in banking and national security and these different narrow sectors since before the dawn of personal computing,” Hall says. “But what’s new is it’s being deployed like bubblegum machines, and people just aren’t testing it property. That includes specifically testing for bias, but also does the thing work?”
Hall cited the Gender Shades project as an example of a harmful effect of poorly implemented machine learning. The project, which was spearheaded by MIT Media Lab’s Joy Buolamwini and former Google data scientist Timnit Gebru (a 2021 Datanami Person to Watch) that identified differences in how facial recognition systems used by law enforcement worked with different groups of people.
“The accuracy disparity between white males and women of color was 40%,” Hall tells Datanami. “[That’s] superhuman accuracy recognition of white males, and very poor recognition accuracy of women of color.”
That leads to all sorts of bad outcomes, such as arresting the wrong person, which has occurred several times with these automated systems, Hall says. Other places where biased algorithms and biased data can cause poor outcomes include employment, housing, and credit.
Business leaders in historically regulated industry, like finance, are aware of the problems with AI. But many outside of that space are clueless, Hall says.
“I would say in those regulated area people are being thoughtful,” he says. “Outside of that, it’s the Wild West.”
As the CTO of the digital advertising company Quantcast, Peter Day oversees systems that use machine learning to decide how millions of dollars of ad dollars are spent. The predictive capabilities works quite well, he says.
“Machine learning is really nothing but ruthless optimization,” says Day, who has a PhD in machine learning from the University of Liverpool. “It will optimize far more efficiently than a human ever would, given the data it’s got.”
Machine learning can drive amazing results, and is even a requirement for business today in some ways, Day says. But many business practitioners suffer blind spots when it comes to what machine learning technology can do, and what it can’t.
“The machine learning is often seen as a magical black box, but it’s nothing more than computational statistics,” Day says. “As with any other method of statistics, you need to be aware of your data set. You need to be really quite careful of the questions you’re asking, because it is ruthless and naïve optimization.”
At Quantcast, Day ensures the machine learning models are continually tested to ensure that they’re working correctly. He leans heavily on his academic background, as well as his experience in quantitative analysis at UBS, to keep bias at bay.
However, he doesn’t see the same attention to detail being employed in the wider world—particularly in sectors where biased models and data carry more danger than advertising. By simplifying access to machine learning technology, it’s resulted in machine learning being applied more broadly.
“We’ve gone from the realms of statisticians who have a deep understanding of how these algorithms work and therefore the pitfall of them being seen as a black box, to it just works,” he says. “That worries me.”
One way out of the AI bias trap is to employ the same level of rigor that data scientists and statisticians have been trained to bring to their craft in academia and the highly regulated financial services industry. In addition to having a deep understanding of how the algorithms work, AI practitioners need to rigorously test their algorithms to ensure they are free of known bias.
“I’m an old fashioned guy. I like statistical rigor to things,” Day says. “I come from a banking background so I’m very familiar with how it’s done in banks. I actually quite like some of that approach, which came out of regulation. But some things that came out of regulation actually led to much better behavior.”
Some of the model evaluation methods that banks and other financial services companies are required to run are over a decade old, and aren’t necessary a great fit for the deep learning approaches that are used today, according to Hall.
“If you’re in a highly regulated space, you probably need to be doing those older, more conservative tests, which is confusing because it’s all about P values, and data scientists don’t work with P values because we have a million rows of data,” he says. “So it’s a little bit confusing for data scientists. But there are existing tests with years of case law and regulation and regulatory commentary behind them.”
Academia is currently generating a slew of new AI bias checking techniques that will work well with the new deep learning methods, Hall says, including tools like Aequitas; he also cited the IBM AI Fairness 360 package as being a good tool. However, companies in regulated areas, like banking, may run into trouble using some of these newer approaches, he says.
Another big believer in bringing academic rigor back to the field of AI is Jacopo Tagliabue, the lead scientist with Coveo, a provider of enterprise search and recommendation software.
“There’s a bunch of things that I feel academia can bring immediately to data science. One is the attention to bias,” including racial bias, Tagliabue tells Datanami. “But honestly, there’s other discussions in academia that are fairly well understood about the ethics of running experiments.”
Tagliabue helps Coveo to ruthlessly optimize the algorithms that drive better search results and recommendations for e-commerce customers. A big part of that optimization effort is the use of A/B testing. Without A/B testing, it can be impossible to know exactly where a probabilistic system – such as AI – stands with the deterministic world.
“This is the fundamental assumption of randomized trial and experiment, and it’s valid everywhere,” Tagliabue says. “It’s the reason we can have drugs, because we test drugs with placebo and real effect. And we measure if people actually get better. So this is the standard of causal discovery, and it works for very simple mechanisms, like recommendation and clicks, or much more complex ones.”
If a company makes changes to a machine learning model, and those changes are leading to biased outcomes for a particular group, it can be detected by running experiments against a known sample data set. Without this testing (A/B or otherwise), it can be difficult to detect the bias before it’s too late.
“The entire point of having an experimental culture is, since you’re doing something that’s hard – hard meaning there are unknowns– the only way to get rid of unknowns is to test in controlled way,” Tagliabue says. “Testing is a foundational part of any AI, of any modern software that incorporates some sort of predictions. There’s no way around that.”
However, A/B tests alone are not enough to ensure that bias stays out of the models. According to Tagliabue, the data scientist must have the knowledge of how the algorithm is working and the experience to know if it’s working the way it should. In other words, putting powerful AI tools into the hands of unqualified technologists is asking for trouble.
“Running the A/B test does not dispense the data scientist to actually know the domain enough to make sure that known biases and known things aren’t creeping in, because it may be there is a spurious causal relationship,” he says. “So what you interpret as a causal correlation, is actually a correlation that just happens that you’re not controlling for in some way. The responsibility still is part of our job and understanding the different slices of data. A/B test are the last test of the experiments.”
The lack of AI testing has already led to tragic consequences. Take for example the autonomous car developed by Uber Technologies that killed an Arizona woman in March 2018. The National Transportation Safety Board concluded that the car failed to detect the woman because it wasn’t trained to identify jaywalkers, or people walking outside of crosswalks.
Had they tested for that before testing the self-driving car, that death likely would not have happened, Hall says. “People are rushing to market without doing the testing and hardening, in my opinion,” he says. “I think a lot of companies are replacing rigor with hype.”