September 12, 2012

Pushing Parallel Barriers Skyward

Ian Armas Foster

As much data as there exists on the planet Earth, the stars and the planets that surround them contain astronomically more. As we discussed earlier, Peter Nugent and the Palomar Transient Factory are using a form of parallel processing to identify astronomical phenomena.

Some researchers believe that parallel processing will not be enough to meet the huge data requirements of future massive-scale astronomical surveys. Specifically, several researchers from the Korea Institute of Science and Technology Information including Jaegyoon Hahm along with Yongsei University’s Yong-Ik Byun and the University of Michigan’s Min-Su Shin wrote a paper indicating that the future of astronomical big data research is brighter with cloud computing than parallel processing.

Parallel processing is holding its own at the moment. However, when these sky-mapping and phenomena-chasing projects grow significantly more ambitious by the year 2020, parallel processing will have no hope.

How ambitious are these future projects? According to the paper, the Large Synoptic Survey Telescope (LSST) will generate 75 petabytes of raw plus catalogued data for its ten years of operation, or about 20 terabytes a night. That pales in comparison to the Square Kilometer Array, which is projected to archive in one year 250 times the amount of information that exists on the planet today.

“The total data volume after processing (the LSST) will be several hundred PB, processed using 150 TFlops of computing power. Square Kilometer Array (SKA), which will be the largest in the world radio telescope in 2020, is projected to generate 10-100PB raw data per hour and archive data up to 1EB every year.”

It may seem slightly absurd from a computing standpoint to plan for a project that does not start for another eight years. Eight years ago, the telecommunications world was still a couple of years away from the smartphone. Now the smartphones talk to us. The big data universe grows even faster, possibly as fast as the actual universe.

It is never a bad idea to identify possible paths to future success. Eight years from now, quantum computing may come around and knock all of these processing methods out of the big data arena. However, if that does not happen, cloud computing could potentially advance to the point where it can support these galactic ambitions.

“We implement virtual infrastructure service,” wrote Hahm et al in explaining their cloud’s test infrastructure, “on a commodity computing cluster using OpenNebula, a well-known open source virtualization tool offering basic functionalities to have IaaS cloud. We design and implement the virtual cluster service on top of OpenNebula to provide various virtual cluster instances for large data analysis applications.”

According to Hahm et al, the advantage essentially comes from using computing power from a cloud to act as one large computing entity, as opposed to carefully splitting up the task over parallel threads. It is akin to taking an integral of a function over the limits of integration as opposed to individually counting up all of the slices made up of height times little change in x.

“This massive data analysis application requires many computing time to process about 16 million data files. Because the application is a typical high throughput computing job, in which one program code processes all the files independently, it can gain great scalability from distributed computing environment. This is the great advantage when it comes with cloud computing, which can provide large number of independent computing servers to the application.”

To test this, the group analyzed data from SuperWASP, an England-based astronomical project with observatories in Spain and South Africa. Specifically, they examined 16 million light curves, which are designed to locate extra-solar planetoids based on differences in light emanated from the potential planet’s host star. According to Hahm et al, “In this experiment we can learn that the larger and less input data files are more efficient than many small files when we design the analysis on large volume of data.”

Looking at the graphs, the only significant difference that arises is when the merged big files match up with the many small files in terms of system CPU time. However, there still exists an advantage for cloud computing in terms of user CPU time and ‘wall-clock time,’ however those differences seem to be small enough such that cloud computing may not be the significant improvement Hahm et all hope it is.

“With the successful result of whole SuperWASP,” Hahm et al concludes, “data analysis on cloud computing, we conclude that data-intensive sciences having trouble with large data problem can take great advantages from cloud computing.” Perhaps cloud computing has the advantage over the petabyte scale. But it seems likely that something completely different will have to be developed between now and 2020 before an Exabyte can be processed in a year.

Related Stories

A Big Data Revolution in Astrophysics

World’s Top Data-Intensive Systems Unveiled

NASA Resource Brings Big Science Data Home

Supercomputing Center Set to Become Big Data Hub

Astronomers Leverage “Unprecedented” Data Set