Reproducibility in Data Analytics Under Fire in Stanford Report
Armed with the same data and told to test the same hypotheses, dozens of independent researchers instead came to widely different conclusions using a variety of analytics techniques, according to a new report from Stanford University that pushes the reproducibility crises in science into a new realm.
The study involved 70 independent research teams from around the world, who were all presented with the same data: functional magnetic resonance imaging (fMRI) scans of volunteers’ brains while they performed a monetary decision-making task.
The teams were told to test nine different hypotheses, but weren’t told how to do it. So each team devised its own methods for preparing the fMRI data for analysis, in addition to performing the actual analysis, which demanded a yes/no answer for whether the brain was activated for specific tasks.
“Right out of the gate, teams modeled the hypothesis tests in differing ways,” writes Adam Hadhazy in a story that appeared last week in Stanford News. “The teams also used different kinds of software packages for data analysis. Preprocessing steps and techniques likewise varied from team to team.”
The groups also set different thresholds for when parts of the brain were “activated” or not, which was a very important piece of data for the analysis, Hadhazy writes. “The teams could not even always agree on how to define anatomical regions of interest in the brain when applying statistical analysis,” he writes.
The researchers ultimately came up with different answers for five out of the nine hypothesis. That’s a significant result that casts doubt on the ability of researchers to reproduce the experimental results, a key tenet of the scientific method.
“The processing you have to go through from raw data to a result with fMRI is really complicated,” said paper co-senior author Russell Poldrack, according to Stanford News. “There are a lot of choices you have to make at each place in the analysis workflow.”
This, of course, is the same type of challenge that analytics and AI teams face in non-academic commercial settings. Defining terms and metrics, and agreeing to a “single version of the truth” for each fact or variable, have been serious challenges since the earliest days of data warehousing and business intelligence, and they remain a significant issue today.
In a non-scientific analytic settings, the data preparation phase often consumes 70% or more of the data scientist’s time. Instead of devising novel models or algorithmic approaches, the data scientist instead is playing data engineer and spending her time writing extract, transform, and load (ETL) scripts.
Scientific rigor is usually considered to demand impartiality and empirical thinking. But as the Stanford study shows, human judgement, with all its biases, still plays an outsize role in the process. The proof, as they say, is in the pudding.
“The main concerning takeaway from our study is that, given exactly the same data and the same hypotheses, different teams of researchers came to very different conclusions,” Poldrack told Stanford News. “We think that any field with similarly complex data and methods would show similar variability in analyses done side-by-side of the same dataset.”