The Week in Big Data Research
This week’s big data research and development stories cover a wide area, as researchers from Europe, China, and the United States figure out how to use big data to solve biomedical problems on this planet along with finding and storing data on other planets. Visualization and archiving big data in large databases also got its share of attention this week.
Without further ado, here begins 2013’s first Week in Big Data Research
Big Planetary Census Data
Fifteen years ago, the salient features of the known extrasolar planets could be written down on an index card. At present the catalog of extrasolar planets numbers in the thousands, and the rate of detection is increasing rapidly.
Highly diverse planets are being identified through a diverse set of observational techniques; photometric transit detection, Doppler radial velocimetry, gravitational microlensing, and direct detection via adaptive optics imaging are all producing discoveries at an increasing rate.
In a recent talk at the Intelligent Data Understanding Conference by Greg Laughlin, Laughlin presented an overview of the census as currently understood, then showed how the different detection methods are producing complementary detections.
Like many areas in astronomy, exoplanetary detection is facing issues related to “big data”. Large online repositories (such as that produced by the Kepler Mission) serve many terabytes of data, much of which has gone analyzed due to the time-consuming algorithms required. Laughlin’s talk sought to highlight the current issues, and showed how ad-hoc collaborations across the community are being formed to deal with the challenges (and the excitement) of this fast-moving area.
A Language for Big Data Visualization
Researchers from the University of California at Berkeley believe that increases in data availability are among the forces behind some recent innovations. However, according to the Berkeley team, visualization technology for exploring data is not keeping up.
They argue that designers may be forced to choose between scale and interactivity. The designers would prefer big displays because of the ability to show an entire data set. However, the Berkeley researchers note that viewers would prefer interactivity.
Performance constraints limit interactions to operating on a small data slice. The Berkeley team presented SUPERCONDUCTOR: a high-level visualization language for interacting with large data sets. It has three design goals: scale, interactivity, and productivity.
Through their presentation, the team showed how high-level programming abstractions support automatic parallelization. They examined three cases: selectors, layout, and rendering. In the case of layout, declarative constructs can further guide parallelization. Together, these ideas enabled their goal of high-level programming of big, interactive visualizations.
Visualizing Semantic Web Data Landscapes
A European team consisting of researchers from Ireland, Cambridge, Maastricht University in the Netherlands, and the University of Bonn in Germany argue that the core to the success of applying Semantic Web technologies (SWT) towards supporting Life Sciences research is the availability of tools that lower the entry barrier for adoption by biomedical researchers.
Researchers need to easily and intuitively exploit and query the wealth of data that is available behind as SPARQL endpoints. Thus, the researchers present SemScape, a semantic-web enabled plugin for the popular network biology software Cytoscape.
SemScape can be used to query any knowledge bases with a SPARQL endpoint by leveraging familiarity with existing software and intuitiveness of big data exploitation through a mechanism that encapsulates the complexity of data in parametric context dependent queries. The team believes SemScape can provide a valuable resource both for data consumers and data publishers.
Archived Stream System for Big Data
A team of researchers from the National University of Defense Technology in China found that the increasing number of applications for large data, such as Web search engines, need to have high availability fulltime tracking, storage, and analysis of a large number of real-time user access logs.
The team argues that traditional common trading application solutions are not always efficient enough to store this high rate into the archive stream. They presented an integrated approach to save this archive of data streams in a database cluster for rapid recovery.
This method is based on a simple replication protocol along with a high performance data loading and query strategy. Experimental results show that their approach efficiently load data and queries and achieve shorter recovery times than the traditional database cluster recovery methods.
Improving Big Data Availability in Massive Databases
The team from the National University of Defense Technology in China also put out research claiming that due to the huge scale and the number of components, big data is difficult to work in the context of relational databases, desktop statistics, and visualization packages.
A significant amount of database replication technology is used to increase the MTTF, but few have a large database system. The team argues that the traditional method of backup is not feasible, and that expensive manpower costs reduce MTTR.
On the basis of analyzing the characteristics of data in large databases, they propose a new method called Detaching Read-Only (DRO) mechanism. It reduces MTTR by reducing the physical change of the data in each database, by separating data node size granularity.
According to the research, analysis and experimental results show that their method can reduce the MTTR an order of magnitude. Further, there are no additional hardware costs, and they also reportedly reduce the high manpower costs.