While there are several common threads that can be found winding through all of the life sciences and “omics” studies, the one parallel that emerges in nearly biological disciplines is that of terabyte and petabyte data sets that need to be efficiently culled.
We tend to hear most often about the “big data” challenges of genomic research, but others in specific scientific areas, including microbiology are finding that they require more advanced tools and approaches to contend with complex and sizable datasets.
Michigan State’s Dr. C. Titus Brown, assistant professor of Computer Science and Engineering and Microbiology and Molecular Genetics, is one of many notable researchers in biology who knows a thing or two about this type of massive data.
Brown heads MSU’s lab for Genomics, Evolution, and Development, which recently undertook the task of researching large collections of data centered around the genomes of soil-based microbes. The project was so complex, however, that it toppled the systems designed to crunch the big microbial data. So Brown’s team did what researchers do best—they fashioned a solution to solve a specific problem.
“We had no idea,” Brown told us, “what options there were when we started. When we started, we were trying to solve biological problems. We have all this data and we need to somehow do some sort of analysis on it. We weren’t experts in data assembly, we weren’t experts in data structures and algorithms and we had no idea what the options were.”
But eventually, they found options which led to solutions, solutions that would have to do without Hadoop’s big data capabilities. Brown would end up shrinking the data, likening it to taking a map and scrunching up little bits of it so that it was smaller but still made sense. He calls them probabilistic data structures as a result of them eliminating repeated or unnecessary data.
“These probabilistic data structures are sort of the flip side of the coin to random algorithms. Random algorithms are algorithms that rely on the average behavior being really good but sometimes the worst case behavior is really bad. When you use probabilistic data structures, you can take out problems where they’re guaranteed to give you absolutely horrible performance and you could also find problems which will give you really good performance.”
The approach is very well outlined in this piece written by Brown. According to Brown, the percolated data stays relatively accurate until about there’s an 18% chance of a missed connection. Brown notes that this is where the structure sharply goes from being ‘basically accurate’ to ‘no longer even remotely accurate.’
It is important to note that the 18% is not a figure representing how much data was chopped but rather the likelihood that there would be a failure as a result of the data being chopped. Brown and his team were able to filter a significant portion of the data. Brown notes that they were able to get workable results when they whittled 60 gigabytes of data to three or four.
“What we did in the paper was we sort of analyzed the limits. We said, ‘For a given amount of memory, for 500 MB of memory, you can store a billion nodes in the memory.’ So it gives you a specific relationship between how much data you use and how much you can store.”
Of course, there are still disadvantages in that there is still a limit to how much data they can analyze given this technique. But the advantage, according to Brown, is that the computers know the limits and will no longer die as a result of trying to allocate memory that doesn’t exist.
While effective, it took about three years to develop their technique, a technique which as it exists now applies only to microbial genomes. However, Brown is okay with that, considering those were exactly the problems they were aiming to solve. “The fourth paper is really about the effects that theory and engineering have on actually doing biology. We’re looking at incredibly complicated microbial populations in soil and we’re able to go several hundred gigabytes of data and distill it down to three or four gigabytes of actual genomic sequence.”
Several hundred gigabytes is a significant improvement over the sixty previously mentioned. But the technology, according to Brown, is improving in not just volume but velocity as well. Six months ago, Brown notes, they were generally able to analyze data that took a week to collect in six to eight weeks. Their benchmark is to be able to analyze that same data in less time than it takes to collect it. “With the new release of our software, we’re pretty sure we can get that down to two to three days. We’re talking about 3-600 gigabytes of data, and not including transferring the data around, it’ll take under a week to analyze that data.”
Brown sees this technique being used to solve microbial problems relating to both the environment and biomedicine. With the environment, Brown is focusing on the highly interesting development that soil-based microbes found in farming fields are contributing a significant amount of carbon dioxide to the atmosphere, such that agriculture may be actually be more responsible for global warming than we may think.
“To understand how and why that happens, you need to understand what the microbes in the soil are actually doing. And that’s where the ability to take the vast amount of data, analyzing it and distilling it into some kind of functional information, is essentially impossible without the technique we developed.”
Brown also looked to the Human Microbiome project already under way, a project designed to research the effects of the billions of microbes that exist in our bodies. While those microbes are not as diverse, Brown’s technique could still prove useful and efficient. “Our technique makes it much easier to use the stuff that you get when you gather a bunch of data about what microbes are present and what function they might be performing.”
Brown here is helping to solve a significant big data problem. It is understood that while there exists quite a lot of data out there, it is difficult to instruct a computer to discern which data is useful and which is not. Brown has taught his computers to do just that, even if it is just for his very specific problem. But perhaps his techniques can be carried to other fields, which would be quite the exciting prospect.