March 12, 2012

A Floating Solution for Data-Intensive Science

Nicole Hemsoth

A recent assertion from a team at Argonne National Laboratory proposed a simple yet still “fringe” answer to an increasingly pressing question for scientists in data-intensive fields.

Before we get there, however, it might go without saying that all scientific disciplines are data-intensive now, especially following the explosion in sensors and data collection gear sparked by the era of limitless mobility.

And speaking of disclaimers, it’s probably also worth pointing out that what’s “fringe” for the academic research community ; can sometimes be considered really out there by the average enterprise IT shop.

But not in this case…for once, it’s the scientific community that is catching up with what the rest of the business world already knows…that when it comes to infrastructure, sometimes the “easy” way (yes, even for potentially world-altering research endeavors) can be the best way. In other words, the fringe is hanging off the other side of the fence.

To explain all of that, let’s back up one last time and park beside the original idea posed by the Argonne team for a moment. As they imply, the problems of big data are really not so much different than enterprise info troubles, just at times at much larger scale.

As Dr. Ian Foster and his group write, “Data from specialized instrumentation, numerical simulations, and downstream manipulations must be collected, indexed, archived, shared, replicated and analyzed. These tasks are not new, but the complexities involved in performing them for terabyte or larger datasets (increasingly common across scientific disciplines) are quite different from those that applied when data volumes were measured in kilobytes.”

This leads to what Foster and comrades refer to as a “computational crisis” in the research community—and one that continues to mount in intensity as data continues to flood in despite the presence of adequate data management tools. At the high level, this crisis is really not much different (other than scale) than those at businesses of every size as they see the flood of data and try to make meaningful use of it in efficient ways.

On that note, it’s worth stating that it’s not just the tools themselves that might be in short supply for some institutions. Even if there were sufficient technologies in place to handle the deluge of scientific data from an ever-growing range of instruments, sensors and devices, the skills required to make use of such tools are in short supply and the learning curve is oftentimes steep. For national labs or research institutions this means valuable time is spent hands-on with the tools rather than the science that necessitated them.

According to Foster and colleagues, there could be an answer to these problems that are set to grow in the era of data-intensive science. The team says that the solution is to provide research data management capabilities to users as a hosted (yes, cloud) service.

For those who flit about in enterprise IT circles, this idea of outsourcing these challenges to the cloud is nothing short of, well, obvious–assuming there aren’t regulatory, security or other policy/paranoia reasons preventing it.

But for the academic community? Cloud, especially in a commodity, commercial sort of service like that provided for the plebes by companies like Amazon—now this is really something of a fringe concept. And if not as a concept, as a fringe use case. After all, when you have enormous clusters at your disposal and all the free brain-labor a PhD or fellowship program can provide to support newfangled data management approaches, why should this be a thought?

Well, the answer seems to go back to that issue of personnel and letting scientists focus on science versus wrapping their heads around the latest big data Hadoopla.

Foster and company say that by using the software as a service model (where the applications are hosted remotely and accessed on any number of devices) via the web, scientists in data-intensive fields can realize some of the benefits the cloud has brought to big web-based businesses and smaller companies that require advanced software but lacked the resources (capital, human, and otherwise) to make use of it prior to simply “renting” it from services like those provided by Amazon, among others.

The researchers say that currently, the costs of research data lifecycle management are quickly mounting as data becomes larger and more complex. However, software as a service or cloud approaches are “a promising solution, outsourcing time-consuming research data management tasks to third-party services.”

From the paper again, “as demonstrated in many business and consumer tools, SaaS leverages intuitive Web 2.0 interfaces, deep domain knowledge, and economies of scale to deliver capabilities that are easier to use, more capable and/or more cost-effective than software accessed through other means….The opportunity for continuous improvement via dynamic deployment of new features and bug fixes is also significant, as is the potential for expert operators to intervene and troubleshoot on the user’s behalf.”

The problem with the idea of clouds for data-intensive science, however, is data movement. No matter which hosted service one is using, it is not cheap and often not even speedy to work in a cloud environment with massive data problems. The Argonne team points to a solution for that problem as well in the form of GlobusOnline, a data movement service for researchers. While that warrants its own discussion in the near term, the big question here is what else is preventing data-intensive science from finding its way to the commodity cloud?

If data movement alone is the reason and tools like Globus Online, for example, fulfill their promise to the future of big data-driven science, then what will stand in the way between this moment and a new era of science in the ether?

Related Stories

Supercomputing Center Set to Become Big Data Hub

The New Era of Computing: An Interview with “Dr. Data”

Cray Parlays Supercomputing Technology Into Big Data Appliance