Follow Datanami:
November 12, 2014

Google Targets Big Genome Data

Google wants to leverage the infrastructure that runs its dominant search engine, Google Maps and Gmail to help analyze and share big genomic data.

Google Genomics is being promoted as a way to store growing DNA data in the cloud. Users could load and export genomic data for free, then pay about $25 a year for storage and queries. Storage is calculated in terabytes per month, queries in millions of API calls.

The search giant joins other cloud giants in the growing competition to store the skyrocketing amount of data as more genomes are being shared, compared and linked. The results of those comparisons are expected to drive the emerging big genomic data sector. Google Genomics claims to allow comparisons “in seconds with SQL-like queries.”

Ultimately, Google says it is trying to converge data science and the life sciences to spur medical research.

The field is considered ripe for innovation. As Google points out, about 99.9 percent of human DNA is identical, but “in practice, the files start out much bigger because you need to do a lot of analysis to identify that zero point one percent that makes each of us unique.”

Since the human genome was first sequenced, the time and cost involved in gene sequencing has plummeted, according to Google, to about one day and $1,000. “With an exponential price drop like that, the volume of sequencing has exploded,” Google Genomics’ product manager Jonathan Bingham noted.

As the cost of DNA sequencing has dropped, data volumes have soared into the petabytes. Bingham estimates the size of each genome represents about 100 gigabytes of data. It is the mixing and matching of genomic data that is driving the need for storage that is linked to data analytics tools, cloud storage proponents insist.

Google’s pitch focuses on the capabilities of its current infrastructure that includes a search index of 100 petabytes and search query returns in about 0.25 seconds. The cloud provider is promising similar results for genetics researchers “without owning a datacenter,” Bingham asserted.

Google Genomics said during its I/O event in June it was working with the genomics community to define a standard API for working with big genomic data sets in the cloud. More recently, it announced it was implementing an API defined by the Global Alliance for Genomics and Health that covers data visualization and analysis.

“We are hosting public data that is available through the API and we’re building open-source software showing how to work with big genomic data using that API,” explained Bingham. Google also said its approach allows for analysis of genomic data via either interactive queries or through massively parallel processing.

Meanwhile, the Google unit said it is also offering data analytics tools like AppEngine, BigQuery, MapReduce and R on the Google Cloud Platform to sift through and share genomic data,

said Bingham, who also heads Google’s efforts to merge cloud computing with life sciences.

The timing of Google Genomic appears propitious: The Global Alliance for Genomics and Health convened last month in San Diego to consider ways to “accelerate sharing of genomic and clinical data.”

Recent items:

Google Re-Imagines MapReduce, Launches DataFlow

Machine Learning Gets a Boost From Google

Datanami