Berkeley Lab Makes Strides in Autonomous Discovery to Tackle the Data Deluge
Data production is outpacing the human capacity to process said data. Whether a giant radio telescope, a new particle accelerator or lidar data from autonomous cars, the sheer scale of the data generated is increasingly leading to massive stores of untapped data as researchers scramble to acquire computational resources and develop algorithms to exploit the treasure troves of information. Now, researchers at Lawrence Berkeley National Laboratory have made strides in a field called “autonomous discovery,” which uses algorithms to effectively decide what to investigate about a dataset with low levels of human involvement.
Autonomous discovery has grown more prevalent over the past few years. One of the more prominent approaches relies on Gaussian process regression, a Bayesian method well-suited for small datasets that enables autonomous discovery by examining a small portion of the data and engaging in probabilistic classification. “In contrast to deep learning, stochastic processes can be used to make decisions based on relatively small datasets, and they provide uncertainty estimates which can optimize the learning process,” said Marcus Noack, a research scientist at CAMERA and lead author of the new paper, in an interview with Berkeley Lab’s Kathy Kincade.
Berkeley Lab researchers in the Center for Advanced Mathematics for Energy Research Applications (CAMERA) applied Gaussian process regression to develop a tool called gpCAM. In CAMERA, researchers have been using gpCAM for synchrotron beamline experiments – but lately, its use has been expanding into other areas. “More and more experimental fields are taking advantage of this new optimal and autonomous data acquisition because, when it comes down to it, it’s always about approximating some function, given noisy data,” Noack said.
One of those new areas is materials science; gpCAM is being used by researchers in Berkeley Lab’s Molecular Foundry to help understand the properties of thin-film semiconductors. “Nanoscale applications that make use of artificial intelligence and machine learning algorithms, specifically for scanning probe systems, have been an interest … for some time,” said John Thomas, a postdoctoral research fellow at the Foundry. “We became interested in using Gaussian processes toward autonomous discovery in the summer of 2020.”
Elsewhere, researchers are using gpCAM to investigate DNA self-assembly. “DNA nanotechnology in the pursuit of self-assembling functional material often suffers from a limited ability to sample the large parameter space for synthesis,” explained Aaron Michelson, a graduate researcher at Columbia University. “Either this requires a large volume of data to be collected or a more efficient solution to experimentation. Autonomous discovery can be directly incorporated in both mining large datasets and guiding new experiments. This allows the researcher to steer away from mindlessly making more samples and puts us in the driver’s seat to make decisions.”
And, the researchers say, this is just the beginning, and gpCAM has applications ranging from environmental studies to drug discovery.
“Noack’s work and leadership have brought together a broad, interdisciplinary co-design community,” said James Sethian, director of CAMERA and a co-author on the paper. “This sort of scientific community building is at the heart of what CAMERA tries to do.”