Follow Datanami:
January 7, 2013

Can Extragalactic Data Be Standardized? Part 2

Ian Armas Foster

The Taiwan Extragalactic Astronomical Data Center (TWEA-DC) launched an initiative to standardize how the wealth of astronomical data is processed. In the second of two parts on the TWEA-DC, we discuss how the TWEA-DC itself is designed to optimize software development. The first part, which examines the design of a Domain Specific Language for astronomy can be found here.

Last week, we profiled an effort headed by the Taiwanese Extragalactic Astronomical Data Center (TWEA-DC) to standardize astrophysical computer science.

Specifically, the object laid out by the TWEA-DC team was to create a language specifically designed for far-reaching astronomy—a Domain Specified Language. This would create a standard environment from which software could be developed.

For the researchers at the TWEA-DC, one of the bigger issues lies in the software currently being developed for big data management. Sebastien Foucaud and Nicolas Kamennoff co-authored the paper alongside Yasuhiro Hashimoto and Meng-Feng Tsai, who are based in Taiwan, laying out the TWEA-DC. They argue that since parallel processing is a relatively recent phenomenon, many programmers have not been versed in how to properly optimize their software. Specifically, they go into how the developers are brought up in a world where computing power steadily increases.

Indeed, preparing a new generation of computer scientists and astronomers is a main focus of the data center that opened in 2010. “One of the major goals of the TWEA-DC,” the researchers say, “is to prepare the next generation of astronomers, who will have to keep up pace with the changing face of modern Astronomy.”

Astronomy in the 21st century is less about pointing powerful telescopes at the sky and more about creating a comprehensive model of the universe around us in a Virtual Observatory (VO). Advances in astronomy and how it’s processed are geared toward enhancing the VO.

 As such, extragalactic images (images from beyond the Milky Way) frequently have to be compared to a large database to resolve various elements, such as what sort of objects are in the picture and the precise location of the selected area. “Astronomy is now based on of large datasets, covering a broad wavelength range,” the researchers explain. “The challenge is to aggregate the information and generate a final product that will bridge different expertise and generate an enhanced scientific output.”

Currently, the data center holds 24 Terabytes of data storage—a respectable figure today but one that will likely need to be improved upon in coming years. Meanwhile, data gets transmitted between the center’s main server and data units at a rate of 500 Megabytes per second while a backup server operates at 125 MB/s.

Further, per the specifications laid out by the researchers, the TWEA-DC’s data warehouse gathers from MySQL relational databases along with astronomy-specific International Virtual Observatory Alliance (IVOA) tools that include Data Access Layer (DAL) and VOTable. Users then interact with the data via a web application.

At first glance of the specifications, the data center, while not yet fully operational, does not seem all that impressive. Their focus, however, is on creating a software development-friendly environment. As a result, the middleware is reportedly top notch. “The strength of our DC is actually its middleware, standing between user interface and data: its frontend is managed as a web service and it is designed to run on parallel IT systems.” Top end middleware is a step toward ensuring a smooth relationship between software developer and data. They expect to make some tools available to the astronomy and computer science community in Spring of this year while the entire archive is scheduled to be released in the fall.

The result of all this? For the most part, it is simply the infrastructure under which the real research will start this coming year. However, the TWEA-DC team has managed to develop some indexing software that they call “Billion Line INdexing in a ClicK”, or BLINK. This core software, based on the Hierarchical Triangular Mesh developed in 2007 to help perform the two main astronomical tasks in location and identification, will serve as the scaffolding upon which other software will be built.

In order to process and index a billion lines in an instant, they designed BLINK to run on heterogeneous parallel systems. A peer-to-peer distributed system that connects the computers in the TWEA-DC system. In the future they hope to index images on more variables than identity and location, such as density, shape, and flux. The researchers say the first implementation of BLINK will be available in the early part of 2013.

The paper also mentions a couple of advanced techniques they plan to develop to determine red-shift (a principle that helps identify where the objects are moving in relation to the Earth) and density, but the main point here is in the software. BLINK, if the researchers incorporate it as they spelled out in the paper, would be a good start in handling big data efficiently across parallel systems. However, according to Kamennoff and Foucaud in the Domain Specified Language paper, a significant paradigm shift has to happen in the development community before the potential big data in astronomy can be fully realized.

The data specifications of the TWEA-DC may not yet be impressive, but the team hopes it can prod that refocusing by making quick and efficient use of the data they have and ultimately fitting it to comprehensive astronomical datasets.

Related Articles

Can Extragalactic Data Be Standardized?

A Big Data Revolution in Astrophysics

Sky Survey Data Lacks Standardization

Datanami