Follow Datanami:
October 28, 2013

PSC Receives Grant for Data Exacell

There is big data, and then there is really big data. While enterprises wrestle with terabytes, or even hundreds of terabytes of data, organizations like the European Bioinformatics Institute store as much as 20 petabytes of life sciences data.  Last week, the National Science Foundation (NSF) doled out a $7.6 million grant for an ongoing technology strategy to help alleviate the data deluge.

This deluge is a significant challenge for these multi-petabyte organizations because of the obvious costs associated with managing this amount of information. While storage is often thought of as a cheap, even free commodity, such is not the case when dealing with prodigious amounts of data that needs to be run through analytical processes.  In many cases, this data is stored in tape-based archives, which proves to have plenty of advantages, though throughput speed is not one of them.

Addressing this issue, researchers at the Pittsburgh Supercomputing Center (PSC) developed what it calls “The Data Supercell (DSC).” Deployed at PSC, the DSC is an amalgamation of technologies aimed at the heart of the conventional tape-based archive. Built around the SLASH2 File System, DSC technology offers the ability to competitively store massive amounts of data using a disk-based system vs. its tape-based counterpart, while providing the lower latency and higher bandwidth needed for data-intensive activities, such as analytics.

Last week, the center announced that it has landed a grant from the NSF to expand on the Data Supercell idea, and start developing a prototype for the next generation of the technology, adding collaborative analysis – the Data Exacell (DXC).

“What’s needed is a distributed, integrated system that allows researchers to collaboratively analyze cross-domain data without the performance roadblocks that are typically associated with big data,” said Nick Nystrom, director of strategic applications at PSC in a statement. “One result of this effort will be a robust, multifunctional system for big data analytics that will be ready for expansion into a large production system.”

According to the NSF award abstract, PSC will be tasked to “implement and bring to production quality additional functionalities important to such work.”

Per the NSF:

“[This] includes improved local performance, additional abilities for remote data access and storage, enhanced data integrity, data tagging and improved manageability. PSC will work with partners in diverse fields of science, initially chosen from biology, astronomy and computer science, who will provide scientific and technology drivers and system validation. The project will leverage current NSF/CI investments in data analytics systems at PSC.”

The NSF investment will aim to leverage analytical data systems already in place with PSC, including Blacklight, a SGI UV1000 cluster, and Sherlock, a YarcData Urika graph analytics appliance. Blacklight, in particular, will receive technology upgrades aimed at increasing its ability to execute increased analytical workloads.

According to the NSF award, the organization hopes to leverage the investment well into the future. Once the new DXC system has been developed, a process that looks to take at least four years, the NSF expects PSC to engage for yet another further iteration: a larger scale deployment aiming at Exascale capacity.

Related items:

Standing on the Shoulders of (Hadoop) Giants 

On Algorithm Wars and Predictive Apps 

Rometty: The Third Era and How to Win the Future 

Datanami