Genomics, a field of study where researchers try to pinpoint a relatively tiny amount of important genes in a sea of DNA, is a perfect testing ground for big data. Further, the process of sequence genes has become exponentially cheaper and faster over the last decade, deepening the sea from which genomics researchers have to fish.
A few months ago, we highlighted an algorithm presented by C. Titus Brown and Michigan State University which essentially produced a reduced map of the genomes they were studying.
Now, a new genomics-based formula is garnering national attention. Last year, Brown University computer science professors Eli Upfal, Ben Raphael, and Fabio Vandin developed an algorithm called HotNet which, according to its website, is “an algorithm for finding significantly altered subnetworks in a large gene interaction network.” The algorithm was ultimately used to find mutated genes in cancerous cells, attracting the attention of the medical industry.
As a result of this research, The National Science Foundation and National Institutes of Health have awarded the professors $1.5 million in additional funding. With the funding, the Brown University team hopes to achieve greater accuracy within their algorithm in determining which mutations are important.
After all, not all mutated genes will necessarily contribute to the development of cancer. Thus is the challenge: not only finding the mutations but obtaining statistical certainty that a particular mutation out of many is relevant.
Of course, while these algorithms would be remarkably useful to the healthcare industry, the team has higher aspirations. Upfal et al are hoping to eventually expand these capabilities past cancerous cells and into other large datasets.
“These datasets have all the good and bad properties of Big Data,” said Upfal. “They’re big, noisy, and require very complicated statistical analysis to obtain useful information.”
If that process of filtering through tons of worthless or irrelevant information to find the nuggets of insight sounds familiar, it may be because it represents almost every big dataset a company working with big data has had to work with.