Big data. Love it or hate it, solving the world’s most intractable problems requires the ability to make sense of huge and complex sets of data and do it quickly. Speeding up the process – from hours to minutes or from weeks to days – is key to our success.
One major source of such big data are physical experiments. As many will know, these physical experiments commonly are used to solve challenges in fields such as energy security, manufacturing, medicine, pharmacology, environmental protection and national security.
Experiments use different instruments and sensor types to research the validity of new drugs, the base cause for diseases, more efficient energy sources, new materials for every day goods, effective methods for environmental cleanup, the optimal ingredients combination for chocolate, or to determine how to preserve valuable antics. This is done by experimentally determining the structure, properties and processes that govern biological systems, chemical processes and materials.
The speed and quality at which we can acquire new insights from experiments directly influences the rate of scientific progress, industrial innovation and competitiveness. And gaining new groundbreaking insights, faster, is key to the economic success of our nations.
Recent years have seen incredible advances in sensor technologies, from house-size detector systems in large experiments such as the Large Hadron Collider and the ‘Eye of Gaia’ billion-pixel camera detector, to high-throughput genome sequencing. These developments have led to an exponential increase in data volumes, rates and variety produced by instruments used for experimental work.
This increase is coinciding with a need to analyze the experimental results at the time they are collected. Faster speeds are required to optimize the data taking and quality, and also to enable new adaptive experiments, where the sample is manipulated as it is observed, e.g. a substance is injected into a tissue sample and the gradual effect is observed as more of the substance is injected, providing better insights into the natural processes that are occurring, as well as results-driven sampling adjustment to capture particularly interesting features --- as they emerge.
The Department of Energy’s Pacific Northwest National Laboratory (PNNL) is recognized for its expertise in the development of new measurement techniques and their application to challenges of national importance. So it was obvious to us to address the need for in-situ analysis of large scale experimental data.
PNNL has a wide range of experimental instruments on site, in facilities such as DOE’s national scientific user facility, the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL). Commonly, scientists would create an individual analysis pipeline for each of those instruments; but even the same type of instrument would not necessarily share the same analysis tools.
With the rapid increase of data volumes and rates we were facing two key challenges: how to bring a wider set of capabilities to bear to achieve in-situ analysis, and how to do so across a wide range of heterogeneous instruments at affordable costs and in a reasonable timeframe. We decided to take an unconventional approach to the problem, rather than developing customized, one-off solutions for specific instruments we wanted to explore if a more common solution could be found that would go beyond shared, basic infrastructures such as data movement and workflow engines.
The result is REXAN, the Rapid Experimental Analysis Framework and Component Library, developed as part of PNNL’s Chemical Imaging Initiative. With REXAN, scientists have been able to create analysis tools to get the answers they need --- in near real-time --- at a much faster pace, because they are able to reuse core analytical components specifically designed for high data rates and volumes.
What is REXAN?
REXAN’s core is a library of analytical components for essential analytical functionalities we scientists commonly deal with, such as: compressions, reduction, reconstruction, segmentation, registration, statistics, feature detection and visualization. In addition the framework offers deployment capabilities such as a data intensive workflow system (MeDICI) and a data management and analysis execution environment (Velo). REXAN’s “framework” approach enabled our team to create tools for very different experimental instruments within a short time frame.
At the same time the approach made it possible to easily engage a multi-disciplinary team of domain scientists, instrument experts, mathematicians and computer scientists (HPC, data management and data-intensive science) to define, modify and optimize the necessary components, as they could focus on well-defined particular challenges, such as optimized feature detection, segmentation or principle component analysis, rather than having to familiarize themselves with a complete analysis process.
So because of this approach, we are not only able to analyze high-resolution mass spectrometry in seconds, but we also can build the tools to do so within days or a few weeks.
REXAN achieves the necessary speed and scalability through a mixture of new mathematical approaches that lend themselves to parallelization and high-throughput optimization, as well as the utilization of optimized data structures, data streaming and high performance computing. In addition to developing new or optimizing existing analytical components, we wanted to allow the integration of existing, optimized solutions as part of the REXAN library. We did this extensively on the side of visualization components, where we closely collaborated with the University of Utah Scientific Computing and Visualization Institute. We incorporated their advanced data streaming and visualization capabilities for tera-scale data to allow the in-situ visual exploration of large-scale experimental results as they were collected. We also brought in well-known tools such as MATLAB, Image-J and ParaView.
The value of REXAN in solving biological problems, in near real-time
Our first challenge was the creation of a real-time analysis pipeline for a new mass spectrometry imaging (MSI) technology called ‘nano-DESI’ (nanospray desorption electrospray ionization), which is under further development at PNNL. The technology enables the imaging of completely hydrated biological samples with high spatial resolution and sensitivity. Nano-DESI is particularly helpful in clinical diagnosis, drug discovery, biochemistry and molecular biology. BUT --- it produces up to 40 GB of data per experiment. And our existing analysis tools could only cope with 100 KB at a time, making it impossible to analyze data across the complete result set.
To cope with these inadequate tools our users would spend many hours, after the experiments were complete, on manual data transformation and integration tasks, analyzing and combining the small individual results into one complete picture. With increasing data volumes and data rates, the post-experimental analysis became prohibitive for their scientific work. What they needed was the ability to quickly visualize the data over the whole image as it was emerging during the experiment, to optimize data capture and provide immediate feedback.
The chemical imaging initiative team at PNNL developed a new integrated analytical tool for them, called MSI QuickView, using REXAN components. MSI QuickView enables the scientists to automate data processing, carry out additional interactive statistical analysis on specific results and adapt the visualization to their needs while the experiment progresses. Scientists also can use MSI QuickView offline for post-experimental analysis. See Figure 1.
Figure 1: MSI QuickView build with REXAN components
MSI QuickView converts the data from the proprietary instrument format to MAT (MATLAB binary format). The tool then measures the intensity value for each scan to obtain the ‘intensity vs. time’ spectrum. A heat map is then generated for each line from this 2-D intensity image.
The process is repeated for each line until a heat map for the full sample is obtained. During the line-by-line scan of the sample the user can change aspect ratios of the heat map, set and adjust filtering processes, create secondary comparative statistics and optimize the visualization contrasts. Due to the large dimensions of the data sets, the scientists need automated support in identifying patterns as well as features of interest. The tool offers a range of automated feature detection and classification methods as well as access to 3-D visualizations using ParaView for the exploration of the complete dataset as it evolves. MSI QuickView is used in production and has not only enabled improved data taking, but in combination with the automated sample position control, it allows the scientists to carry out many more experiments in a shorter time span.
From days to minutes: using REXAN to evaluate microbial biofilms
Since MSI QuickView, the team has developed an additional set of tools based upon REXAN, such as an analysis tool for X-ray micro tomography, where we not only improved the analysis itself through new capabilities, but also reduced the analysis time from days to minutes. See Figure 2.
Figure 2: TomographyView build with REXAN components
The tool was tested in the domain of microbial biofilms. The biofilms present a new way to investigate problems relevant to the medicine and the health industries (e.g. periodontitis, cystic fibrosis, pneumonia and infections of catheters and prosthetic joints). To gain a better understanding of the function and interaction of the biofilm within its natural system, the scientists needed new capabilities for visualization, compositional analysis and functional characterization. We developed BiofilmView, which offers scientists the required tools for quantitative analysis, segmentation, feature detection and qualitative visualization. Other tools created as part of BiofilmView included support functions for the integrated analysis of scanning transmission X-ray microscopy and transmission electron microscopy results, as well as more applications for tomography.
REXAN: Reduce, reuse and recycle
With an the increasing number of tools the REXAN team found that it could at times reuse a significant range of components initially developed for previous tools, thus drastically reducing the team’s development time and effort, while providing our user scientists with new and unprecedented in-situ capabilities.
Path to next steps
The next direct challenges for the team will be to enable the remote use of REXAN-based analysis tools (using near real time analytical capabilities at the home organization or in the cloud while carrying out experiments somewhere around the country) and scalability tests (in particular creating a near real time analysis for a new dynamic transmission electron microscope, or DTEM, system under development, that is expected to produce two terabytes per second and requires complex analytical processes including comparative analysis with modeling results). However, thinking further ahead for the long term use of REXAN, our team currently is working on a semantic repository system for REXAN’s library components that we can distribute --- and to which others can contribute.
In the short term we are interested in helping other scientists to utilize REXAN’s capabilities to build new, near real-time analysis tools, as well as expand the range and efficiency of REXAN components. Furthermore we are particularly interested in how REXAN could be used in an industrial setting and are actively looking for partners.
Gleaning future value
In the long term I expect that we will have a REXAN repository that can be downloaded by everyone, the repository will include both open-source and proprietary components that scientists and companies can use to create near real-time analysis tools for their needs. Any user will be able to add his or her own components to the knowledge store of their downloaded repository version. If appropriate, scientists can then contribute those new components back to the central repository (with adequate attribution) or keep them secure only for their own usage.
We see REXAN as a new approach to enable a new class of experiments by offering an affordable, sustainable way of creating near real-time experimental analysis tools.
About Kerstin Kleese van Dam
Kerstin Kleese van Dam is currently associate division director and lead of the Scientific Data Management group at the U.S. Department of Energy’s Pacific Northwest National Laboratory.
Prior positions include director of computing at the Bio-Medical Faculty at University College, London; IT program manager and lead of the Scientific Data Management Group at the Science and Technology Facilities Council in the United Kingdom; HPC specialist at the German Climate Computing Center (DKRZ); and Software Developer at INPRO, a research institute of the German automotive industry.
Kleese van Dam has led collaborative data management and analysis efforts in scientific disciplines such as molecular science (e-minerals), materials (e-materials, materials grid), climate (PRIMA, NERC Data Services, U.S. Department of Energy’s climate science for a sustainable energy future), biology (DOE’s Bio Knowledgebase Prototype Project, integrative biology), and experimental facilities (ICAT, chemical imaging). Her research is focused on data management and analysis in extreme scale environments. She is the 2006 YEAR recipient of the British Female Innovators and Inventors Silver Award.