Wrangling Big Data to Fight Pediatric Cancer
High-performance computing and the cloud are enabling vast improvement in scientists’ ability to simulate and analyze data, and genetic sequencing and research are accessible to more scientists, researchers and medical professionals than ever before.
But a new bottleneck has emerged: we are drowning in data. The trick is how to efficiently manage the volume and complexity of that data while making it secure yet accessible to many.
In order to address the big data bottleneck, Dell is building a unique cloud environment for a pediatric cancer trial in conjunction with the Translational Genomics Research Institute (TGen) and the Neuroblastoma and Medulloblastoma Translational Research Consortium (NMTRC).
The collaborative effort is creating a model for how to use HPC and cloud computing to simplify information access and sharing and bridge the information gaps between science and medicine. Through the trial, scientists and oncologists are identifying targeted and personalized treatments for children fighting neuroblastoma.
The cloud will provide the additional computing capacity to support the “real time” processing of patient tumors and prediction of the best drug therapy for a specific patient, based on the genetic makeup of that child’s tumor.
This clinical trial involves dozens of scientific and medical partners across the country. Providing information technology to analyze laboratory results and to support collaboration across a secure network of clinical sites is crucial to creating a knowledge base that supports clinical decision-making.
Because TGen’s research is so cutting edge, scientists and doctors require flexibility to follow their research as it evolves. The effort involves studying tumor samples from patients, getting the genomic sequencing data from lab instruments, analyzing that data, reporting the findings to a tumor board and ultimately using the results to make decisions about the best treatment for the patient.
One of the chief challenges was that the newest lab instruments have the capacity to generate raw data at an increasingly faster rate than ever anticipated by Moore’s Law. The quantity of data being produced from a single instrument is doubling about every 12 months, while at the same time the cost to analyze it is falling by half.
The end result is that the total amount of genomic data being generated is doubling nearly every six months. Moreover, the data objects produced are complex files with important metadata properties about the samples they came from and the instruments that produced them. And the files can be extremely large, up to 3TB depending on the instrument. The data associated with a particular patient currently is about 200TB and growing. Because this is an active area of research, data needs to be kept available to validate and compare analysis algorithms.
Additionally, for this clinical trial there are 11 participating sites both generating and analyzing data. A hybrid approach was required to manage the data coming from the instruments and to be able keep large amounts of data accessible to all of the sites to facilitate collaboration in a secure, cost-effective manner. It was also important to localize data near HPC capacity both in the cloud and on premise to speed analysis and validation.
The cloud became the medium of exchange for data as well as analysis capabilities, allowing researchers to share their raw information as well as algorithms for analyzing that data. As a result, TGen and its collaborators can quickly turn data into knowledge, knowledge into diagnosis, diagnostics into therapies, and therapies into better quality of life for patients.
A colleague of mine coined the phrase “cloud-to-ground” to describe the architecture built to address these issues: an environment that could manage data and not just archive it. We needed to create a virtual library of data that could be accessed by researchers and allow data to be checked out and analyzed using HPC capabilities.
We are using Dell’s innovative technology to enable fluid integration between premise-based capabilities (the ground) and virtual capabilities (the cloud). This provides the framework to move the data fluidly through the research lifecycle, protect it, and make it available for future use. Data can be ingested at various sites, moved to the cloud and then made available for analysis either in premise-based HPC environments or any HPC cloud environment.
The unique challenges of personalized medicine require us to address data volume, complexity and locality, as well as collaboration. By creating integrated hybrid cloud environments, we can harness the power of Big Data and unleash the potential of personalized medicine.
About the Author
Anatol Blass, Ph.D. is a System Consultant with Dell Healthcare and Life Sciences. He has worked with leading academic, research and biotechnology companies to integrate and analyze laboratory data and create knowledge. As the lead architect for Dell’s collaboration with TGen, he is working to address the technology challenges of the world’s first personalized medicine clinical trial for pediatric cancer.