Googlers Speak Out on the Scourge of ML Underspecification
A few days ago, 40 authors (all but a handful hailing from Google) published a 59-page paper. The topic at hand: why so many machine learning models, borne out by internal testing, proceed to then fail spectacularly in real-world applications. The answer, the Googlers say, is underspecification – a blight on machine learning that, they stress, requires substantive solutions.
“An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain,” they write. In plain language: an underspecified model can think of a bunch of reasonably accurate explanations for why a dataset looks the way it does. The problem comes in when researchers assume that all of those explanations are equivalently valid based solely on the model’s training results, without accounting for real-world factors that may have escaped the model’s training process. In those situations, the authors say, “ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains.”
By way of illustration, the Googlers highlight examples spanning “computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics.” In epidemiology, for instance, they discuss how early data from an epidemic (such as the COVID-19 pandemic) is easily explained by a variety of models that do not substantively account for major factors – such as the gradually diminishing number of susceptible people in an area as an epidemic infects (and then renders immune) larger and larger portions of the populace.
“Importantly, during the early stages of an epidemic … the parameters of the model are underspecified by this training task,” they write. “This is because, at this stage, the number of susceptible is approximately constant at the total population size (N), and the number of infections grows approximately exponentially.”
As a result, they say, “arbitrary choices in the learning process” determine which parameters are deemed most predictive by the model, despite different models predicting “peak infection numbers, for example, that are orders of magnitude apart.”
“We have seen that underspecification is ubiquitous in practical machine learning pipelines across many domains,” the researchers write. “Indeed, thanks to underspecification, substantively important aspects of the decisions are determined by arbitrary choices such as the random seed used for parameter initialization.“
So, the question remains: how should researchers address underspecification in the model design process?
“Our findings underscore the need to thoroughly test models on application-specific tasks, and in particular to check that the performance on these tasks is stable,” they write. In fact, they say, the “extreme complexity” of modern ML models makes it more or less certain that most models will be underspecified, and researchers must ensure that the inevitable underspecification “does not jeopardize the inductive biases that are required by an application.”
The authors say that the best approach for resolving the widespread underspecification process will involve designing domain-specific stress tests that accurately represent the challenges a model will face in the real world.
“For example, within the medical risk prediction domain, the dimensions that a model is required to generalize across (e.g., temporal, demographic, operational, etc.) will depend on the details of the deployment and the goals of the practitioners,” they elaborate. “For this reason, developing best practices for building stress tests that crisply represent requirements, rather than standardizing on specific benchmarks, may be an effective approach.”
About the paper
The paper, titled “Underspecification Presents Challenges for Credibility in Modern Machine Learning,” is accessible to the public here.