Astrophysics: The Icing on the Big Data Cake
According to Dr. Kirk Borne, Professor of Astrophysics and Computational Science at George Mason University, astrophysics data challenges represent some of the wider problems in big data. Dr. Borne embraces data science with boundless enthusiasm. Dr. Borne recently shared about his astrophysics work and offered a unique perspective on the current state of big data.
Historically, astrophysics has been considered the most data-heavy of disciplines, but over the past decade or so, as we’ve entered the era of information, astrophysics can now be thought of as the icing on the big data cake, according to Dr. Borne. “There was a time in the not too distant past, maybe 15 years ago where astrophysics was ahead of the game in terms of recognizing the growth in data volumes and the need for high-performance computing to deal with those data-analysis challenges. But I think everyone in the world now – in sciences, health and medicine, government and social media, and so on – are recognizing the power and growth of data volumes and opportunities to make discoveries from big data, but the astrophysics discoveries in my mind are really where it’s at – explaining what’s happening in the universe. So that’s the icing on the big data cake,” he says in the podcast, which can be accessed at http://www.hpcwire.com/soundbite/data-avalanche-astrophysics.
One of the potential pitfalls for confusion over the big data term stems from the fact that the enterprise and commercial realms often employ different definitions. Big data really is different things to different people. Dr. Borne emphasizes that big data is a concept.
“The word ‘big’ implies a lot of things,” he says. “Big volume, big variety, high rates at which the data are coming at you, but big data overall is a concept. For hardware vendors, frequently big data refers to that big data computing computation environment. To the retail people, big data might refer to the social media channel from which they learn how customers are reacting to their products and ads. The conversation becomes: is big data all about the computer? Or is big data all about Twitter and FaceBook?”
For the professor, the focus is on the academic research dimension. “Big data refers to what we are doing nowadays, which is basically measuring and tracking everything, whether it’s things in space, astrophysically, or things in medicine, or whatever dimension you are looking at,” he says. “We are now putting essentially a sensor on everything, measuring everything. The ability to make discoveries from this data stream is what I most appreciate about it.”
|Dr. Kirk Borne|
In astrophysics, the biggest data-related challenge relates to complexity and variety, and more specifically high-dimensionality. The professor and his team undertake surveys with many thousands of parameters. When it comes to sheer volume, astronomers are typically working with petabytes of information. Compared to social media and intelligence community data sets, this is not that big. But the complexity challenges are where the difficult lies.
This astronomical survey work that Dr. Borne is spearheading is both data- and compute-intensive. Dr. Kirk Borne gives an example of a thousand dimensional correlation matrix. In this thousand by a thousand matrix, there’s a million cells that need computing. The level of complexity is a moving target with the next-generation of studies of astronomical objects expanding to tens of thousands or even hundreds of thousands of measurements. Surveys that include time series measurements, repeated measurements, add even more complexity. “If we start looking at a cross correlation matrix of a hundred-thousand by a hundred-thousand, [there are] billions of individual elements that need to be computed,” the professor says. “Discovery from that, which requires looking at all pairwise, and three-wise and four-wise correlations, now we’re talking 10 to the very large power of computations needed.”
Dr. Borne was part of the team that did data mining for the Galaxy Zoo project, which he describes as a citizen science project, in which about a million galaxies were presented to a community of between 700,000-800,000 volunteers who were presented with an image of the galaxy and asked to classify it. Their job is to characterize the galaxies as spiral shaped or elliptical shaped, round or flat, and so on.
After each galaxy has been scrutinized by about 200 people, an interesting pattern develops. There are some cases where there is good consensus on what the galaxy is, and then there are some that have a split opinion with half the volunteers saying one thing, and the other half saying the exact opposite. This discrepancy gave birth to another project. Dr. Borne and his team have been performing application and machine learning data algorithms on the galaxies, with the goal of identifying the difference between a galaxy that everyone can agree on and a galaxy where there great diversity of opinion.
To do this they’re relying on a technique called latent variable discovery. It works in a similar manner to sentiment analysis on social media, where companies analyze social media feeds to see how people are responding to their products. Typically, the words people use don’t explicitly express their feelings for the product. Comments tend to be more subtle, like “Wow look at this.” Embedded in the words people use, however, is a hidden variable, which is how they actually feel about it. Borne’s team is taking that same concept to galaxy classification. They are trying to discovery what variable creates the diverse response. So far, predictive models have only achieved about 5% accuracy, worse than random, according to the professor. What’s noteworthy is how complex a problem this becomes. “We’re trying to do this latent or hidden variable statistical analysis, which basically means looking at all linear and non-linear combinations of existing variables to see if any of those correlate and that again turns into a very large computational problem,” says Dr. Borne.
Among the statistical tools of note, Dr. Borne is a proponent of MapReduce and Hadoop, methods of distributing work to a large number of processors and doing very simple computations on each processor. “And that’s sort of what we’re doing,” he says. “We’re doing these very simple pair-wise, or three-wise, or four-wise correlations. It’s not rocket science, it’s pretty basic statistics from text books probably decades ago, but doing it on tens of thousands if not tens of millions of nodes in a distributed computing environment is where the real challenge is. But it’s also the place where that discovery is going to be made. So people who are doing traditional high-performance computing, probably is not exactly the right computing paradigm. The real interesting compute paradigm is graph computing, graph processing, where you look at it as a network, you look at it as data nodes linked to other data nodes. This is becoming really popular in social network analysis.”
“I like to think of my galaxies as a social network, that is they share properties in common with other galaxies that share properties in common with other galaxies,” he adds. “We build this network of knowledge and then we try to find the strong links in that network.”
The professor is also excited about the “happy war” that is developing between R and Python. He believes that this competition will result in greater development and more powerful tools coming out of both of those communities. He says that he leans towards the Python space, but he thinks it’s important that students are learning SAS, and R, and Python, so they can use the right tool for the job.
Aside from being passionate about data science, Dr. Borne has some strong thoughts on the topic of data science education, which he separates into two areas: “data science in education” and “education in data science.” He believes schools should be training people in the discipline of data science to become data scientists, but he also believes that there needs to be a stronger emphasis on “data science in education.”
“Every discipline in the world is now becoming information-driven and information-rich – whether it’s journalism or medicine and even art and music,” he notes. “People are talking about informatics – that is data science applied to music or dance or athletic competition. So every student in every discipline should have some exposure to these tools almost in the same way we require basic math or science for students to graduate no matter what their field is. I firmly believe we should require some kind of data science or informatics course for every student no matter what their field is.”
While Dr. Borne is foremost an astrophysicist, his interests are wide and far-ranging. He’s currently working on a project with the National Institutes of Health National Library of Medicine, which has loaded its entire archive of medical publications into a graph database. Now it is possible to query the metadata – titles of the papers, the authors, the citations the keywords, etc. – to identify connections between and among papers.
Having a linked database of medical knowledge makes it possible to do all kinds of research projects. One student of Dr. Borne’s is looking for anomalies and finding interesting connections between papers that you wouldn’t otherwise expect. There’s also examples of other discoveries, even cancer treatments, achieved through database mining.
“This whole idea of linked data and graph processing and the graph computing model is quite fascinating for me now, and taking me a bit outside of astronomy, but eventually I want to bring it back to astronomy,” says the professor.