Follow Datanami:
January 2, 2013

Can Extragalactic Data Be Standardized?

Ian Armas Foster

The Taiwan Extragalactic Astronomical Data Center (TWEA-DC) launched an initiative to standardize how the wealth of astronomical data is processed. In the first of two parts on the TWEA-DC, we discuss the design of a Domain Specific Language for astronomy. The second part, which examines how the TWEA-DC itself is designed to optimize software development, can be found here.

While lacking the direct practical applications that the study of genomics offers, astronomy is one of the more compelling use cases big data-related areas of academic research.

The wealth of stars and other astronomical phenomena that one can identify and classify provide an intriguing challenge. The long-term goal will be to eventually use the information from astronomical surveys in modeling the universe.

However, according to recent research written from French computer scientists Nicolas Kamennoff, Sebastien Foucaud, and Sebastien Reybier, the gradual decline of Moore’s Law and the resulting lack of computing power combined with the ever-expanding ability to see outside the Milky Way are creating a significant bottleneck in astronomical research. In particular, software has yet to catch up to strides made in parallel processing.

The researchers identified three problems they hope to solve through the TWEA-DC: misuse of resources, the existence of a heterogeneous software ecosystem, and data transfer.

According to the authors heading an initiative at TWEA-DC, “As part of the setup of a new Data Center in Taiwan we aim at designing an open-source, distributed solution to enhance data analysis capabilities.” This solution, if it is realized, would create a standardized astrophysical programming language designed specifically for handling vast amounts of other-worldly data.

The researchers point to two attempts have been made of late: SAMP (Simple Application Messaging Protocol) and FASE (Future Astronomical Software Environment). However, both projects are still in their infancy. They argue that SAMP only partially handle data transfer and environment issues, per the researchers, while FASE is still in the proof-of-concept phase.

With that being said, the team grants the usefulness of those two projects, especially with regard to finding the bottlenecks and inefficiencies. The project based in Taiwan is also in its formative stages, as much of the paper outlined a roadmap for future study.

The researchers note a complex problem regarding software design: many creators learned and were trained on single-core systems that obviously would not process in parallel. Since the most robust of systems these days employ parallel processing, this has to an extent led to a dearth of software that efficiently makes use of high powered astronomical systems.

That issue is a root of the resource misuse problem as a whole. They suggest software developers were brought up under the impression that computing power would not slow and that optimization concerns are for the most part eschewed. “Because of limitations due to electromigration and sub-threshold conduction, the increase of processor speed has stopped in the past decade.”

As a result, parallel processing CPUs and GPUs were introduced, and software has not quite adapted to it. As the paper noted, “Software developers usually do not exploit efficiently the computing resources, because of unfortunate habits resulting from single core development.”

Developers are relatively slow in adapting to the new parallel processing environment, hampering astronomical research. However, there exist other software issues as well.

For example, the lack of a unified environment in which the software may operate is hampering astronomical software development. This is not limited to astrophysics as many sectors of the big data industry have had to or are currently dealing with a heterogeneous ecosystem. Frequently, the solution has been to introduce another language infrastructure entirely. Indeed, the name of the paper is “Development of an Astrophysical Specific Language for Big Data Computation.”

The paper mentioned frequently that creating a ‘monolithic software system’ was both overly ambitious and counter-productive. “Obviously a monolithic software approach does not make sense, but we advocate here that a modular distributed middleware is a valuable solution.”

The best thing one can do then is to create an atmosphere where various programs talk to each other easily and let that exist in an open source environment. But to what extent can open source help solve a problem when the programmers developing those open source pieces are behind the parallel processing curve?

While many astronomers are at least versed in various programming languages and advanced software tools such that they can run complex computations and could hypothetically participate in that open source development, the researchers here are focusing on IT professionals to design and implement that infrastructure.

“Considering the limited resources available to develop sustainable features and software in astronomy, sharing common parts of algorithms and data structures used by the different software is essential.” Again, the researchers point out that it is the domain of the IT professional, not the astrophysicist, to develop the new system. “Furthermore Astronomers do not always have sufficient training to deal with low level layers programming, which should be developed by IT specialists. We therefore propose to explore a Domain Specific Language for astronomical data analysis.”

No language, however, can make up for the complicated scheduling dynamics. Highly efficient scheduling will play a key role in the infrastructure the team implements on the Taiwan Data Center. While the team has not yet revealed specifics, the paper mentioned the goals of the scheduling system, “Such a system will be able to identify available resources dynamically and select them for a specific request based on various characteristics: distance from the data, computation and memory capabilities, availability, reliability, etc.”

The characteristic ‘distance from the data’ is key on various levels: many observatories are working from data sent to them from satellites; data has to be transferred to the Taiwan Data Center from observatories for processing and analysis; and the data may be stored some distance away from the processing units. The first two are more of a data transfer issue (to which their response is a less than satisfying “One obvious solution to this problem is to move towards the data location”), the researchers hope to alleviate the third problem by doing computations in-memory and avoiding the inefficiencies of I/O.

“By building a uniform and strongly scheduled pipeline, we can ensure that the requested computations are mainly performed in-memory, as Random Access Memory (RAM) is far more efficient than I/O.”

Creating a standardized environment in which astronomical data can be stored and analyzed is currently one of the main focuses of the team at the Taiwan Extragalactic Astronomical Data Center.

Related Articles

Can Extragalactic Data Be Standardized? Part 2

A Big Data Revolution in Astrophysics

Sky Survey Data Lacks Standardization