May 27, 2020

Reproducibility in Data Analytics Under Fire in Stanford Report

Alex Woodie

(MriMan/Shutterstock)

Armed with the same data and told to test the same hypotheses, dozens of independent researchers instead came to widely different conclusions using a variety of analytics techniques, according to a new report from Stanford University that pushes the reproducibility crises in science into a new realm.

The study involved 70 independent research teams from around the world, who were all presented with the same data: functional magnetic resonance imaging (fMRI) scans of volunteers’ brains while they performed a monetary decision-making task.

The teams were told to test nine different hypotheses, but weren’t told how to do it. So each team devised its own methods for preparing the fMRI data for analysis, in addition to performing the actual analysis, which demanded a yes/no answer for whether the brain was activated for specific tasks.

“Right out of the gate, teams modeled the hypothesis tests in differing ways,” writes Adam Hadhazy in a story that appeared last week in Stanford News. “The teams also used different kinds of software packages for data analysis. Preprocessing steps and techniques likewise varied from team to team.”

The groups also set different thresholds for when parts of the brain were “activated” or not, which was a very important piece of data for the analysis, Hadhazy writes. “The teams could not even always agree on how to define anatomical regions of interest in the brain when applying statistical analysis,” he writes.

The researchers ultimately came up with different answers for five out of the nine hypothesis. That’s a significant result that casts doubt on the ability of researchers to reproduce the experimental results, a key tenet of the scientific method.

Bigger datasets and increasingly complex workflows are making it harder for researchers to reproduce experimental results – a key part of the scientific process. (Image credit: Getty Images)

“The processing you have to go through from raw data to a result with fMRI is really complicated,” said paper co-senior author Russell Poldrack, according to Stanford News. “There are a lot of choices you have to make at each place in the analysis workflow.”

This, of course, is the same type of challenge that analytics and AI teams face in non-academic commercial settings. Defining terms and metrics, and agreeing to a “single version of the truth” for each fact or variable, have been serious challenges since the earliest days of data warehousing and business intelligence, and they remain a significant issue today.

In a non-scientific analytic settings, the data preparation phase often consumes 70% or more of the data scientist’s time. Instead of devising novel models or algorithmic approaches, the data scientist instead is playing data engineer and spending her time writing extract, transform, and load (ETL) scripts.

Scientific rigor is usually considered to demand impartiality and empirical thinking. But as the Stanford study shows, human judgement, with all its biases, still plays an outsize role in the process. The proof, as they say, is in the pudding.

“The main concerning takeaway from our study is that, given exactly the same data and the same hypotheses, different teams of researchers came to very different conclusions,” Poldrack told Stanford News. “We think that any field with similarly complex data and methods would show similar variability in analyses done side-by-side of the same dataset.”

Finding a Single Version of Truth Within Big Data

Applications: Enterprise Analytics

Technologies: Middleware

Sectors: Academia

Vendors: Stanford

Tags: analytics, big data, ETL, pre-processing, reproducability, Stanford

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Reproducibility in Data Analytics Under Fire in Stanford Report

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Reproducibility in Data Analytics Under Fire in Stanford Report

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link