Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy
The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.
The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21stcentury. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.
For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.
While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?
Read the full story here at sister website HPCwire.com.