Follow Datanami:
November 26, 2012

Sky Survey Data Lacks Standardization

Ian Armas Foster

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The main question that the Knowledge Infrastructures team at UCLA is asking is, “what data collection methods are astronomical scientists given that these astronomers must all constantly be using some form of big data?” To answer the question, the team recorder 14 interviews with 13 researchers who worked with SDSS at some point between May 2011 and February 2012, with each interview lasting from 50 minutes to two hours.

It is intriguing that the sample size, consisting of 13 individuals, seemed rather small considering that, by the paper’s own admission, over 400 scientists worked on SDSS during the first two iterations from 2000 to 2008 alone. What the report lacks in sample size it makes up for in diversity, with interviewees spanning undergraduates to post-docs to faculty.

It should be stated that the report seems like a rather preliminary study, a preview of a much larger study to be released later, owing to the fact that the lone results section is labeled, “preliminary results.”

If the UCLA team were looking to find some sort of standardized practice, or just get a sense of the practice in general, 13 people spanning multiple levels of influence may be enough. One could certainly write a well-informed book on information culled from what the team mentioned were highly effective and insightful interviews.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large  data  sources  have  not  served  to homogenize  information  retrieval in  the  field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

This should not be altogether too surprising. While the SDSS has been around since 2000, only recently have they been able to employ big data in their efforts to map the cosmos. However, it does remain somewhat remarkable how all over the place the report makes the research seem. This summer, we detailed how the Palomar Transient Factory, an astronomical search collaboration between Cal Tech and Berkeley in which they relied on a fairly standardized data infrastructure.

With that being said, one can have all of the infrastructure in the world, but without the necessary datasets, that infrastructure lies useless. After all, it should be noted that the report seems to deal with the collection and storing of these datasets, not their use. The piece referred to above deals almost entirely with their use.

Related Articles

A Big Data Revolution in Astrophysics

20 Lessons Enterprise CIOs Can Learn from Supercomputing

Amazon Accelerates NASA’s Search for Life on Mars