NIH Effort Looks to Compress Big Genomics Data
The steady progress being made in “precision,” or genomic medicine is bringing with it a growing need to get a better handle on soaring data volumes. For example, a single human genome sequence constitutes a roughly 140-gigabyte data file.
In an effort to get its arms around genomic and other data, the National Institutes of Health (NIH) recently awarded a $1.3 million contract to researchers at the University of Illinois and Stanford University to develop new data compression approaches. The grant is one of several new software development efforts within NIH’s Big Data to Knowledge Initiative.
The focus of the NIH big data initiative is supporting research and development on new approaches and tools “to maximize and accelerate the integration of big data and data science into biomedical research,” the agency said.
Other federal agencies are also targeting big data for the biosciences. Earlier this week, the National Science Foundation (NSF) said its Division of Mathematical Sciences would collaborate with the NIH big data initiative to address biomedical data science projects. NSF and NIH said they are soliciting proposals through Aug. 6, 2015, for a joint program called Quantitative Approaches to Biomedical Big Data.
“One of the critical application areas at the interface of the biomedical and data sciences is precision (or personalized) medicine,” NSF noted in its solicitation. “Achieving the goal of precision medicine will require combining data across multiple formats and developing novel, sophisticated mathematical, statistical, and computational methods that facilitate high-confidence predictions for individuals.”
The agencies expect to award one-year “planning grants” of less than $100,000 per grant.
Along with the genomic big data compression study, NIH is also funding research on interoperable biomedical data repositories and “early stage development” of biomedical computing, informatics and big data science.
The university data compression effort will focus on more efficient ways of representing genomic information stored in a dataset. In one example, a long sequence of “As” could be represented as “A times 50,” researchers said.
Genomic data also lends itself to data compression since sequences often contain much repetition as a result of a relatively small alphabet. Similar techniques were developed in the 1990s for video compression that led directly to applications like high-definition television.
The university researchers said their primary goal is “development of a suite of data compression software that will handle several types of genomic data, including DNA sequence data, meta-genomic data, quality scores for sequences and data from gene functional analyses.
“While compression of each data type requires a unique approach, the group hopes to identify aspects of compression strategies that are transferable across many types of genomic data,” the researchers added.
The effort also will deliver data compression algorithms, their analysis, software prototyping and “benchmarking on real data.”