November 19, 2012

BioInformatics: A Data Deluge with Hadoop to the Rescue

Nicole Hemsoth

Sponsored Content

The Data Deluge

Have you had your genome mapped today? A quick search on the web reveals that genome-sequencing is available as a consumer product for only $299. How far we’ve come from a 13-year effort to map the human genome. (See http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml).

Genome mapping is only one aspect of BioInformatics, however. Medical records, clinical trials, adverse drug reactions, and medical imaging are just a few of the many applications that generate the current deluge of medical data.

BioInformatics as a discipline has been advancing by leaps and bounds. What started as some computer savvy researchers writing Perl scripts now has hundreds of formal education courses offered at leading universities. Numerous open source projects assist researchers in dealing with all the new information sources coming their way.

Apache Hadoop and BioInformatics

Your genomic data is “big” (Reading DNA accurately is difficult, so the readers do bi-directional multi-scan operations. This makes for larger files.) It is big enough that traditional computers struggle to figure out how to process all the base-pair sequences that represent how you, as a human, are constructed. Apache Hadoop is an open source project for managing big data that uses commodity computers, lashed together in a cluster, to operate on massive files rapidly. You can take your genomic file, even if it is 300GB, and let Hadoop sequence, sort, and look for variants in your DNA to help doctors provide better medical care.

There are several BioInformatics software applications that run on Hadoop to help researchers analyze DNA including Seal, Bowtie, Cloudburst, and Crossbow. If you go to pubmed.gov, a National Institutes of Health (NIH) site, and search for the term “hadoop”, you’ll see refereed publications on how to use this excellent parallel processing environment for medical research.

Cloudera and BioInformatics

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

FDA Adverse Drug Reaction Data, Cloudera, and Impala

Use case description: query the FDA data to determine which drugs listed in the incident reporting database resulted in hospitalization.

We will make a number of simplifying assumptions in the data model and query. Please don’t debate if we are double-counting outcomes in the result – the purpose of this example is to show you how rapidly you can get started. The fact that we can debate double counting illustrates the versatility of Hadoop: Rather than struggle to get to an answer using existing database technology we can load up all the data, start to work with it, and then debate what the right answer is based on real experience with the data.

Here are the steps:

Download and run the Cloudera Impala VMware image
Use wget to download the FDA incident reporting files
Unzip and load the data into HDFS – the Hadoop file system
Create the Hive metadata to access the files
Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive
Just for fun run the same query in Hive to compare performance

Here we go:

Download and run the Cloudera Impala vmware image

https://downloads.cloudera.com/demo_vm/vmware/cloudera-impala-demo-vm-cdh4.1.1-vmware.tar.gz

Use wget to download the FDA incident reporting files

wget http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/UCM319844.zip

Unzip and load the data into HDFS – the Hadoop file system

unzip UCM319844.zip

hadoop fs -mkdir outcomeDir

hadoop fs -mkdir drugDir

hadoop fs -put OUTC12Q2.TXT outcomeDir

hadoop fs -put DRUG12Q2.TXT drugDir

Create the Hive metadata to access the files

hive

you are now in the hive prompt and can enter the table definitions

drop table outcomes;

create external table outcomes (

isr string,

outcome string)

row format delimited fields terminated by ‘$’

location ‘/user/cloudera/outcomeDir’

;

drop table drugs;

create external table drugs (

isr string,

drug_seq string,

role_cod string,

drugname string,

val_vbm string,

route string,

dose_vbm string,

dechal string,

rechal string,

lot_num string,

exp_dt string,

nda_num string

)

row format delimited fields terminated by ‘$’

location ‘/user/cloudera/drugDir’

;

Exit hive with control-C

Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive

impalascripts/start-impala-state-store.sh

impalascripts/start-impalad.sh

time impala-shell –impalad=127.0.0.1:21000 –query_file=drugQuery.sql

Just for fun run the same query in Hive to compare performance

time hive –f drugQuery.sql

Oh, right, you need the contents of the drugQuery.sql

$ cat drugQuery.sql

select

drugname, outcome, count(*) nmocc

FROM drugs

JOIN outcomes on (drugs.isr=outcomes.isr)

where

outcome in (‘DE’,’LT’, ‘HO’)

group by drugname, outcome

order by nmocc DESC

limit 100;

Summary

Wow, if you made it to here you’ve had an exciting day! On my laptop the Impala query took less than 1 second. Your mileage may vary – be careful not to over commit memory and end up paging. Hive is still a great query environment, but in this case it took 1min37sec to get the same result.

What drug results in the most hospitalizations as reported by the drug interaction database? You’ll have to run the query to find out.

We at Cloudera would be delighted to work with you on solving your BioInformatics challenges.

More information about Impala can be found here http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-video-recording.html

Vendors: Cloudera

Tags: apache, apache hadoop, bioinformatics, cloudera, Hadoop, hpc

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

BioInformatics: A Data Deluge with Hadoop to the Rescue

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link