Why Relying on Data Hygienists Is Not a Realistic Way to Ensure Bias-Free AI
There has been a lot of buzz lately around the new career paths that will crop up as artificial intelligence (AI) systems become more commonplace. In fact, Gartner predicts that, though AI will automate the jobs of 1.8 million people by 2020, AI will create 2.3 million jobs by that same year—a net increase of 500,000 jobs.
This is to be expected; AI will introduce numerous new priorities and requirements that we can only begin to speculate about now. These positions could include personality trainers for AI and algorithms forensics analysts, but one in particular that intrigues me is a data hygienist.
This is still a relatively new term—even Google wants to redirect you to results for dental hygienists. So what is it? Data hygienists will ensure that the data used to train AI systems is “free from any slanted perspective.” It requires someone to monitor and adjust the data that the machine is fed, so it processes only the purest, most accurate data possible. This sounds promising, but there is frankly no way that humans are capable of erasing their own biases to be useful to a machine.
At first glance, this seems like a reasonable role to rely on as the use of AI systems spreads within organizations. By nature, machines are objective. They take data in, process it, and determine the best course of action. In this respect, data quality is important, because the machine is only as good as the data that it has been provided. It makes sense to want to introduce a human element that can confirm that the data being used is the best available and is free of errors.
My point arises when we start to consider the issue of bias. The chances that a human—or even a machine for that matter—could operate with no bias whatsoever are basically zero.
In my office, I have a poster that displays the 188 known cognitive biases. I refer to this poster often to remind myself that we all may react a certain way or have motivations that are driven by biases that we may not even be aware that we have. These go beyond the racial, gender, or age biases that we are all so familiar with. They range from the backfire effect—where someone doubles down on their beliefs when they’re confronted with information that contradicts them—to anchoring, which is the tendency we all have to rely too heavily on one piece of information—usually the first piece that we encounter—when making a decision. What are the chances that a data hygienist will assess data without using even one of those 188 biases?
First, there’s no way a data hygienist will ever be completely free of human biases because there are simply too many that are so ingrained in the way we make decisions or process information. Second, it’s impossible for humans to interact with an AI system in a way that doesn’t pass some of our own biases on to the machine, regardless of the models we train or algorithms we build, test, and retest. The algorithm was created by someone who, by definition, has their own biases, which are present in models—even if they were never intended to be there.
Instead of saying that a data hygienist will help us ensure that AI is free of bias or that AI is free of bias in its own right, we need to instead be aware of the sheer number of biases that exist. We need to recognize that AI systems are built and tested by people who unintentionally teach them to respond to something in the same way that our own biases lead us to respond.
In the end, data is data. It’s tempting to say that it is untarnished by the biases that dictate how we interact with each other and how we process information, but this ignores the subtle ways in which biases work. With so many, it is impossible to get rid of them all. Rather, we all need to work to minimize the effects of our biases and to understand that, however careful we may be, our AI systems and our data will bear the mark of our human inclinations.
About the author: Roman Stanek is the CEO of GoodData, which he founded in 2007 with the mission to disrupt the business intelligence space and monetize big data. Prior to GoodData, Roman was founder and CEO of NetBeans, the leading Java development environment (acquired by Sun Microsystems in 1999) and Systinet, a leading SOA governance platform (acquired by Mercury Interactive, later Hewlett Packard, in 2006).