Follow Datanami:
November 19, 2012

BioInformatics: A Data Deluge with Hadoop to the Rescue

Marty Lurie, [email protected]

The Data Deluge

Have you had your genome mapped today?  A quick search on the web reveals that genome-sequencing is available as a consumer product for only $299.  How far we’ve come from a 13-year effort to map the human genome.  (See http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml). 

Genome mapping is only one aspect of BioInformatics, however. Medical records, clinical trials, adverse drug reactions, and medical imaging are just a few of the many applications that generate the current deluge of medical data. 

BioInformatics as a discipline has been advancing by leaps and bounds.  What started as some computer savvy researchers writing Perl scripts now has hundreds of formal education courses offered at leading universities.  Numerous open source projects assist researchers in dealing with all the new information sources coming their way.

Apache Hadoop and BioInformatics

Your genomic data is “big” (Reading DNA accurately is difficult, so the readers do bi-directional multi-scan operations.  This makes for larger files.)  It is big enough that traditional computers struggle to figure out how to process all the base-pair sequences that represent how you, as a human, are constructed.  Apache Hadoop is an open source project for managing big data that uses commodity computers, lashed together in a cluster, to operate on massive files rapidly. You can take your genomic file, even if it is 300GB, and let Hadoop sequence, sort, and look for variants in your DNA to help doctors provide better medical care. 

There are several BioInformatics software applications that run on Hadoop to help researchers analyze DNA including  Seal, Bowtie, Cloudburst, and Crossbow.  If you go to pubmed.gov, a National Institutes of Health (NIH) site, and search for the term “hadoop”, you’ll see refereed publications on how to use this excellent parallel processing environment for medical research.

Cloudera and BioInformatics

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics.  Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

FDA Adverse Drug Reaction Data, Cloudera, and Impala

Use case description: query the FDA data to determine which drugs listed in the incident reporting database resulted in hospitalization.

We will make a number of simplifying assumptions in the data model and query.  Please don’t debate if we are double-counting outcomes in the result – the purpose of this example is to show you how rapidly you can get started.  The fact that we can debate double counting illustrates the versatility of Hadoop: Rather than struggle to get to an answer using existing database technology we can load up all the data, start to work with it, and then debate what the right answer is based on real experience with the data.

Here are the steps:

  1. Download and run the Cloudera Impala VMware image
  2. Use wget to download the FDA incident reporting files
  3. Unzip and load the data into HDFS – the Hadoop file system
  4. Create the Hive metadata to access the files
  5. Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive
  6. Just for fun run the same query in Hive to compare performance

Here we go:

  •  Download and run the Cloudera Impala vmware image

https://downloads.cloudera.com/demo_vm/vmware/cloudera-impala-demo-vm-cdh4.1.1-vmware.tar.gz

  • Use wget to download the FDA incident reporting files

wget http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/UCM319844.zip

 

  • Unzip and load the data into HDFS – the Hadoop file system

unzip UCM319844.zip

hadoop fs -mkdir outcomeDir

hadoop fs -mkdir drugDir

hadoop fs -put OUTC12Q2.TXT outcomeDir

hadoop fs -put DRUG12Q2.TXT drugDir

  • Create the Hive metadata to access the files

hive

you are now in the hive prompt and can enter the table definitions

drop table outcomes;

create external table outcomes (

isr string,

outcome string)

row format delimited fields terminated by ‘$’

location ‘/user/cloudera/outcomeDir’

;

drop table drugs;

create external table drugs (

isr string,

drug_seq string,

role_cod string,

drugname string,

val_vbm string,

route string,

dose_vbm string,

dechal string,

rechal string,

lot_num string,

exp_dt string,

nda_num string

)

row format delimited fields terminated by ‘$’

location ‘/user/cloudera/drugDir’

;

Exit hive with control-C

  • Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive

 impalascripts/start-impala-state-store.sh

 impalascripts/start-impalad.sh

 

time impala-shell –impalad=127.0.0.1:21000 –query_file=drugQuery.sql

 

  • Just for fun run the same query in Hive to compare performance

time hive –f drugQuery.sql

Oh, right, you need the contents of the drugQuery.sql

$ cat drugQuery.sql

select 

   drugname,    outcome,   count(*) nmocc

FROM drugs

JOIN outcomes on (drugs.isr=outcomes.isr)

   where

   outcome in (‘DE’,’LT’, ‘HO’)

group by    drugname,   outcome

order by    nmocc DESC

limit 100;

Summary

Wow, if you made it to here you’ve had an exciting day!  On my laptop the Impala query took less than 1 second.  Your mileage may vary – be careful not to over commit memory and end up paging.  Hive is still a great query environment, but in this case it took 1min37sec to get the same result.

What drug results in the most hospitalizations as reported by the drug interaction database?  You’ll have to run the query to find out.

We at Cloudera would be delighted to work with you on solving your BioInformatics challenges.

More information about Impala can be found here http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-video-recording.html

Datanami