Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
Rogue Wave

March 12, 2012

A Floating Solution for Data-Intensive Science


A recent assertion from a team at Argonne National Laboratory proposed a simple yet still “fringe” answer to an increasingly pressing question for scientists in data-intensive fields.

Before we get there, however, it might go without saying that all scientific disciplines are data-intensive now, especially following the explosion in sensors and data collection gear sparked by the era of limitless mobility.

And speaking of disclaimers, it’s probably also worth pointing out that what’s “fringe” for the academic research community  can sometimes be considered really out there by the average enterprise IT shop.

But not in this case...for once, it’s the scientific community that is catching up with what the rest of the business world already knows…that when it comes to infrastructure, sometimes the “easy” way (yes, even for potentially world-altering research endeavors) can be the best way. In other words, the fringe is hanging off the other side of the fence.

To explain all of that, let’s back up one last time and park beside the original idea posed by the Argonne team for a moment. As they imply, the problems of big data are really not so much different than enterprise info troubles, just at times at much larger scale.

As Dr. Ian Foster and his group write, “Data from specialized instrumentation, numerical simulations, and downstream manipulations must be collected, indexed, archived, shared, replicated and analyzed. These tasks are not new, but the complexities involved in performing them for terabyte or larger datasets (increasingly common across scientific disciplines) are quite different from those that applied when data volumes were measured in kilobytes.”

This leads to what Foster and comrades refer to as a “computational crisis” in the research community—and one that continues to mount in intensity as data continues to flood in despite the presence of adequate data management tools. At the high level, this crisis is really not much different (other than scale) than those at businesses of every size as they see the flood of data and try to make meaningful use of it in efficient ways.

On that note, it’s worth stating that it’s not just the tools themselves that might be in short supply for some institutions. Even if there were sufficient technologies in place to handle the deluge of scientific data from an ever-growing range of instruments, sensors and devices, the skills required to make use of such tools are in short supply and the learning curve is oftentimes steep. For national labs or research institutions this means valuable time is spent hands-on with the tools rather than the science that necessitated them.

According to Foster and colleagues, there could be an answer to these problems that are set to grow in the era of data-intensive science. The team says that the solution is to provide research data management capabilities to users as a hosted (yes, cloud) service.

For those who flit about in enterprise IT circles, this idea of outsourcing these challenges to the cloud is nothing short of, well, obvious--assuming there aren’t regulatory, security or other policy/paranoia reasons preventing it.

But for the academic community? Cloud, especially in a commodity, commercial sort of service like that provided for the plebes by companies like Amazon—now this is really something of a fringe concept. And if not as a concept, as a fringe use case. After all, when you have enormous clusters at your disposal and all the free brain-labor a PhD or fellowship program can provide to support newfangled data management approaches, why should this be a thought?

Well, the answer seems to go back to that issue of personnel and letting scientists focus on science versus wrapping their heads around the latest big data Hadoopla.

Foster and company say that by using the software as a service model (where the applications are hosted remotely and accessed on any number of devices) via the web, scientists in data-intensive fields can realize some of the benefits the cloud has brought to big web-based businesses and smaller companies that require advanced software but lacked the resources (capital, human, and otherwise) to make use of it prior to simply “renting” it from services like those provided by Amazon, among others.

The researchers say that currently, the costs of research data lifecycle management are quickly mounting as data becomes larger and more complex. However, software as a service or cloud approaches are “a promising solution, outsourcing time-consuming research data management tasks to third-party services.”

From the paper again, “as demonstrated in many business and consumer tools, SaaS leverages intuitive Web 2.0 interfaces, deep domain knowledge, and economies of scale to deliver capabilities that are easier to use, more capable and/or more cost-effective than software accessed through other means….The opportunity for continuous improvement via dynamic deployment of new features and bug fixes is also significant, as is the potential for expert operators to intervene and troubleshoot on the user’s behalf.”

The problem with the idea of clouds for data-intensive science, however, is data movement. No matter which hosted service one is using, it is not cheap and often not even speedy to work in a cloud environment with massive data problems. The Argonne team points to a solution for that problem as well in the form of GlobusOnline, a data movement service for researchers. While that warrants its own discussion in the near term, the big question here is what else is preventing data-intensive science from finding its way to the commodity cloud?

If data movement alone is the reason and tools like Globus Online, for example, fulfill their promise to the future of big data-driven science, then what will stand in the way between this moment and a new era of science in the ether?

Related Stories

Supercomputing Center Set to Become Big Data Hub

The New Era of Computing: An Interview with "Dr. Data"

Cray Parlays Supercomputing Technology Into Big Data Appliance

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There is 1 discussion item posted.

how long will it take?
Submitted by Richard Altmaier on Mar 13, 2012 @ 4:13 PM EDT


I'm wondering how long (years) it will take for a data analysis application to emerge, functioning on the cloud storage platform. This feels like the earliest days of cluster computing, where people had to start from scratch in redesigning analysis applications to be cluster friendly.
I am anxious to see the first proof point: including data upload timing, storage size and pricing(!!), and data analysis time.
Does Argonne have data?
Thanks! Rich

Post #1

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event