Follow Datanami:
February 19, 2019

Machine Learning for Science Proving Problematic

Scientists are raising red flags about the accuracy and reproducibility of conclusions drawn by machine learning frameworks. Among the remedies are developing new ML systems that can question their own predictions, show their work and reproduce results.

Concerns about the efficacy of early machine learning predictions in disciplines ranging from astronomy to medicine were noted by Rice University statistician Genevera Allen during last week’s annual meeting of the American Association for the Advancement of Science.

Allen said brittle ML frameworks are often flawed since they are designed to come up with some kind of prediction, often failing to account for scientific uncertainties. “A lot of these techniques are designed to always make a prediction,” Allen said. “They never come back with ‘I don’t know,’ or ‘I didn’t discover anything,’ because they aren’t made to.”

Allen, an associate professor of statistics, computer science and electrical and computer engineering at Rice, questions whether scientific discoveries based on the application of machine learning techniques to large data sets can be trusted. “The answer in many situations is probably, ‘Not without checking,’” Allen said.

Her conclusions underscore a fundamental problem facing machine intelligence: data scientists still do not understand the mechanisms by which machines learn. Developing next generation systems based on ML models that make predictions based on the system’s understanding of analyzed data could go a long way toward addressing current uncertainties.

Allen, who is also affiliated with the Baylor College of Medicine, said the application of machine learning to precision medicine illustrates the problem of “uncorroborated data-driven discoveries.” In one application, machine learning was applied to genomic data to find groups of patients with similar profiles. That information can then be used, for example, to develop drugs targeted to the specific genome of a disease.

“But there are cases where discoveries aren’t reproducible,” Allen noted. “The clusters discovered in one study are completely different than the clusters found in another.”

One reason is that rigid ML frameworks are designed to spot all possible groups. “Sometimes, it would be far more useful if they said, ‘I think some of these are really grouped together, but I’m uncertain about these others,'” Allen said.

Allen is the founding director of Rice University’s Center for Transforming Data to Knowledge. Her research focuses on multivariate analysis, graphical models, statistical machine learning and data integration, with a specialization on statistical methods for using big data derived from genomics and “neuroimaging.”

“There is general recognition of a reproducibility crisis in science right now,” Allen told the BBC. “I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.”

Recent items:

Machine Learning’s Big Role in Population-Level Genetic Study

It’s Still Early Days for Machine Learning Adoption