Three Ways Biased Data Can Ruin Your ML Models
Machine learning provides a powerful way to automate decision making, but the algorithms don’t always get it right. When things go wrong, it’s often the machine learning model that gets the blame. But more often than not, it’s the data itself that’s biased, not the algorithm or the model.
That’s been the experience of Cheryl Martin, Ph.D., who worked as an applied research scientist at the University of Texas, Austin and NASA for 14 years before joining the AI crowdsourcing outfit Alegion as its chief data scientist earlier this year.
“You often hear that the algorithm is biased, or the machine learning is algorithmically biased,” Martin tells Datanami. “Although that can be true, and there is bias in algorithms, most of the problems you have with bias…are related to the data and not the model or the algorithm.”
Here are the three types of data-based bias in machine learning that Martin says data scientists should be most worried about:
1. Sample Bias
Sample bias occurs when the distribution of one’s training data doesn’t reflect the actual environment that the machine learning model will be running in. Martin uses the training of the machine learning models controlling a self-driving car as an example of sample bias.
“If you’re trying to build a self-driving car, and you want it to drive at all times of day — at day and at night — and you’re only building training data based on daylight video, then that’s a bias that’s in your data,” she says. “The humans helping to build the training data for the algorithm can be completely correct, and have no bias. And yet the data is still biased because you didn’t include any nighttime examples. It’s bias to the day time.”
It’s a data scientists job to make sure the sample they’re building on matches the environment it’s going to be deployed in. Doing that well takes time on the saddle. “We use our experience in building lots of data sets to work with the customers to help identify potential source of that kind of sample bias,” she says.
2. Prejudice or Stereotype Bias
Even if a data scientist gets a good representative sample of data to train her model, she can still be bit by the second type of bias lurking in the weeds, which is prejudicial or stereotyping bias. This bias can be tough to account for, but that doesn’t minimize its potential for unwanted distortion in predictive models.
To illustrate this bias, Martin uses the real-world example of a machine learning model that’s designed to differentiate between men and women in pictures. When the training data contains more pictures of women in kitchens than men in kitchens, or more pictures of men writing computer code than women writing computer code, then algorithm is trained to make incorrect inferences about the gender of people engaged in those activities.
“That’s not because you sampled the data wrong or took a subset that’s incorrect of the data. It’s just that’s the causal conclusion that might be interpreted,” Martin says. “This shows that, as you build a machine learning model, it is simply a mathematical model of what is similar about the things that you’re trying to group and what is dissimilar about the things your trying to distinguish.
Data scientist must control for this type of bias. There are a variety of ways to do this, either on the front-end of the project or on the back-end. Martin says the data scientist could choose to under sample the number of pictures of women in the kitchen, or oversample the number of men in the kitchen.
“You can also control for that by creating other features in the data or having a secondary [filter],” she says. “There’s lots of ways to control the input data, or do post-processing on the output. What technique you use may work equally well, but the trick is understanding that your distribution of your sample reflects something that you don’t want in your output.”
3. Systematic Value Distortion
Another source of bias in the data is systematic value distortion, which most often occurs when there’s a problem with the device making an measurement or an observation. This type of bias can skew the machine learning results in a particular direction.
“Imagine if your training data’s camera had a property that filters colors in some way. But the other cameras that you might have in your environment…are more accurate,” Martin says. “So if you have a systematic distortion of your color scheme from your measurement device, that can cause a bias in your data that will effect your output.”
If the problem is a general lack of precision in the data-gathering device, and an abundance of noise in the data, then it might average out over time, Martin says. But if the measurements are consistently skewed in one direction all the time, then it can wreak havoc with the data used for training the model, and ultimately generate a bad result.
Tackling Bias in Data
Bias often results from the selection of the data itself, rather than an error with labeling the data, Martin says. A data scientist must take extra care to handle these real-world data sets to ensure that the bias doesn’t skew the results of the machine learning model.
“The way we address bias,” she says, “is by looking at the data and understanding how an algorithm might be deployed and what the target environment is, and doing a match between looking at the characteristics of that environment and the data that we might be labeling.”
That’s no easy task, and it’s not something they teach in school. Rather, it’s something that data scientists must learn through experience, which can come through working on real-world data problems at a university (as Martin did with contract work at the U-T) or through learning on the job.
Recognizing and adapting to these three types of data bias is something that data scientists and machine learning practitioners have a hard time dealing with until they have a certain amount of real-world experience, Martin says. That’s because schools often only teach their students about bias in the models and the math, and not about bias that may be present in the data itself.
“When you take a machine learning class or you’re learning about it for the first time in school, you learn that bias in machine learning is a property of the algorithm, and it’s related to how tightly or how flexibly the math can fit the model,” she says. “The purpose of getting the degree is to learn the skills, and then finding the practical experience is critically important.”
But biased data can bite even the most experienced data scientist in the bum. The only way to thwart biased data is through constant vigilance, Martin says. “You can be aware of these types of bias but you don’t necessarily know the dimensions, how much you have to characterize your data, to cover the entire scope that would relevant to your environment.”
Experience is critical, but it’s not enough. “It does take some experience and it takes intuition and insight into the domain,” she continues. “However, it’s a moving target. Often the production environment is the real world, and that rarely stays the same. You can always miss something – even a domain expert misses things occasionally and get a surprising result.”
That’s why it’s so important to take an iterative approach to data science, to always test one’s models, and ultimately have a human in the loop.
“You should always continue to push on them with different types of data and see how they behave,” Martin says. “And you should watch them in production, and have an exception handling piece where, if you have low confidence result or are new territory that hasn’t been addressed in the past, you can route those to human decision-makers.