Follow Datanami:
August 3, 2015

The Role of Bias In Big Data: A Slippery Slope

Ted Dunning

When most people hear the word “bias” they think of gender or racial discrimination or other situations where preconceptions lead to bad decisions.

These kinds of biases and bad decisions are common. Tests have shown that blinding judges to gender makes a big difference in orchestra auditions [1], inserting perceptions of gender affects hiring for university jobs [2] and having a name that sounds African American costs an applicant the equivalent of 8 years of experience [3].

People are often not aware of the extent to which their decisions are affected by these biases.  In the case of orchestra auditions, judges claim to be making decisions based purely on ability, but they make different decisions when they cannot see a musician. In the case of university hiring, female professors showed as much bias as male professors although they are presumably aware of and opposed to discrimination against women in the workplace. Clearly, bias can cause problems.

The bad news is that some level of bias appears to be unavoidable in human decisions.

But the surprising thing is that some form of bias has been shown to be absolutely necessary for machine learning to work at all [4]. Similarly, some forms of bias appear innate and in many cases very useful in human reasoning. Why is this apparent contradiction possible?

Based on results like these, it seems like the right course of action is to work bias_word_cloudto expunge all kinds of bias.  The real problem, however, is not bias per se.  Instead the problem is uncontrolled and unrecognized bias that does not match reality. A bias against hiring flatworms as violinists is probably just fine (though some cellists might disagree). So the real question is how can we make our bias — our assumptions about reality — more explicit and, importantly, how can we make it so that we can change our assumptions.

Bias in Machine Learning

You have to have bias to train a model. If you start with no assumptions, then all possible universes are equally likely including an infinite number of completely nonsensical ones. It’s not feasible to start that way and successfully train a model. Instead, you give your model a smaller collection of possibilities from which to choose – possibilities that are reasonably likely to occur in reality. In other words, you give your model hints about what is possible and what is plausible. A simple example is that an internet ad, no matter how well designed or targeted, will not get a 90% click through rate as much as we might wish it could. It’s technically not impossible, but realistically, it just isn’t going to happen. Injecting a bias into the model that says that the click-through-rate on an ad is likely to be in the range from 0.1%
to 5% is a way to teach the model what your general experience is. Without that, the model wastes too much time and too much training data learning to eliminate model parameters that are far out of scope of what is reasonable.

So some assumptions (bias) are required to do anything at all. The question then becomes one of whether you should have your assumptions baked into the learning algorithm or have a way of expressing them mathematically and explicitly.

Make Bias Explicit and Controlled

There are many ways to build machine learning models, and one way in which these methods differ is by how they inject assumptions.

Inbalance scale some techniques, the assumptions are implicit and difficult to modify. The data scientist may not even realize what types of assumptions or bias are inherent in the technique nor be able to adjust the assumptions in light of new evidence. This type of bias is one to avoid.

In contrast, Bayesian inference is the branch of the theory of probability that deals with how to incorporate explicit assumptions into the process of learning from data. Using Bayesian techniques allows you to express what you know about the problem you are solving, that is, to be explicit and controlled in injecting bias. You know what bias you are dealing with, and if you realize your assumptions are incorrect, you can change them and start learning again.

When You Have Eliminated the Impossible, Never Say Never

One way to avoid misleading outcomes in machine learning due to mishandling of bias is to put soft limits on the assumptions you inject. All of the encoded assumptions in your model should express your experience but also allow for the possibility that a particular case is extremely rare yet still very slightly possible.  Instead of using an absolute statement of impossibility, allow for surprises in the assumptions you make. Thus, if you want to say that a parameter should be between 1 and 5%, it is good practice to allow for the possibility that the parameter is more extreme than expected and could lie outside that expected range. We might instead say that the parameter will very probably be in the range from 1 to 5%, but that there is a small chance that it is outside that range. Making that small chance non-zero helps avoid a meltdown in the learning algorithm when your assumptions turn out to be wrong.

Reserve a touch of skepticism

A key best practice in dealing with bias is to admit to yourself that your model could be radically wrong or the world may have changed. Keep watching for that possibility and be ready to change your assumptions – and maybe your model – if you find evidence that your results are substantially inaccurate.machine in the brain

How would you recognize that situation? By monitoring how well reality – events as they occur – actually match what your model predicts, you can continually check on the correctness of your assumptions and thus how well you’ve chosen and controlled bias. In other words, don’t just give your model (and yourself) a passing grade when you deploy it and then never look back. When you see performance change from previous levels or relative to other similar models, you should start to suspect a systematic error in your assumptions and start experimenting with weaker or different assumptions.

Data scientists need to have a healthy dose of skepticism. This doesn’t mean losing confidence in the value of their analyses or the outcomes of their models, but it means staying alert and being ready to adjust. Just because a model appeared to work once does not mean that it will continue working as the context changes. It must be monitored. Most of the time when a model degrades over time, small changes in the model assumptions can restore previous performance. Occasionally larger structural changes in the way the world works will require that a model be rethought at a deeper level.

Knowing What You Don’t Know

Even when your model does appear to be behaving well, you should still be aware of what you don’t know and put accurate error bounds on your results. Good work is not perfect, so the truly capable data scientist always is aware of the limits of what is known. This practice of being careful about what you do not know is an important step in preventing yourself (and your models) from over reaching in a way that would undermine the value of your conclusions.

The Human Element

Don’t expect data scientists to be super humanly un-biased. Data scientists come with biases baked in, just like any human does. Data scientists will express the negative aspects of their biases by clinging to favorite models or techniques longer than is warranted or by inserting possibly erroneous assumptions into algorithms.TD_pull_quote

This means that data scientists need to be on the alert to detect cases where their bias migrates from wisdom borne of experience into oblivious pig-headedness. There is surprisingly little distance between these two, so it is very important to keep a close eye on the possibility that things have come unglued. External reviews are helpful as is continuous checking of predictions against actual observations.

The Value of Prediction

Reality should always the last word. The proof of any model is how well it predicts events that are happening right now. This process, often called now-casting or retro-diction, can allow the performance of a model to be continuously assessed and re-assessed.


People and mathematical models all suffer from (and benefit from) bias that comes in many forms. This bias can be the cause of great problems or make it possible to achieve great things. Being unaware of bias is one of the main ways it can cause serious problems. This danger applies to the bias that we as humans carry into our own decision making or the bias that may be hidden in some of the algorithms we choose. On the other hand, conscious and controlled bias in the form of intentional assumptions is a necessary and desirable part of effective machine learning. It’s necessary to limit the options a machine learning model works with, and to do this well, you should inject assumptions that are carefully considered. Furthermore, in order to represent reality as accurately as possible, you should continue to monitor and evaluate the outcome of machine learning models over time. And in all cases, be aware of what you don’t know as well as what you do.


[1] Claudia Goldin and Cecilia Rouse. 1997. Orchestrating Impartiality: The Effect of “Blind” Auditions on Female Musicians.  NBER Working Paper #5903. 

[2] Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Graham and Jo Handelsman. 2012. Science faculty’s subtle gender biases favor male students. PNAS 2012 109 (41) 16474-16479.

[3] Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review, 94(4): 991-1013.

[4] Wolpert, D.H. and Macready, W.G. 1997. “No free lunch theorems for optimization.” IEEE Transactions on Evolutionary Computation, 1(1): 67-82.


About the author: Ted Dunning is Chief Application Architect at MapR Ted DunningTechnologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield.

Related Items:

Big Data’s Small Lie – The Limitation of Sampling and Approximation in Big Data Analysis

How Machine Learning Is Eating the Software World

Big Data Outliers: Friend or Foe?