The Week in Big Data Research
Welcome to The Week in Research, covering what’s been on the academic and scientific big data radar for this beginning part of November.
Unlike last week’s research news brief, the news from the MapReduce and Hadoop side is slim. Instead, the most interesting items we were able to discover deal with approaches to meshing, integrating and drilling down through big data to achieve specific organizational, functional and visual goals to help make sense of it.
Without further delay, let’s dive in with another NoSQL approach, albeit one that is targeted at a particular type of data and use case.
Punting a New NoSQL Approach
A group of researchers in Tsinghua National Laboratory for Information Science and Technology in Beijing has been tackling the problems created by the move to digitization of large volumes of library data. While this might sound like a “simple” problem when one thinks of libraries as mere text and images, the group contends that the challenges are significant—and require a novel approach.
To counter the issues, the Chinese researchers spent considerable effort creating a NoSQL database, which they call PuntDB to improve and optimize the way traditional digital resource management systems handle meta-data management, digital content storage and label management.
The team argues that as we move headlong into the era of big data, the standard architecture of digital resources need to keep pace with massive, complex, heterogeneous and continuous changing data. More specifically, many traditional materials have been moved into digital library forms, which has created problems around everything from long-term storage, perseveration and of course, data management.
The researchers describe PuntStore in detail and put it in real-world context as they describe how thir system for working through these problems has been deployed successfully to Chinese Science and Technology History library and solved the issue of managing the heterogeneous and complex metadata. The group’s test results shows that PuntStore could be an effective solution of similar application scenarios.
New Platform to Share Sensor Access
According to researchers from the Computer Science and Engineering department at Aalto Unviersity in Finland, complex event processing has a large number of uses but it is dominated by proprietary systems and vertical products versus open technologies.
They claim that as data grows through the Internet of Things and nearly everything we carrying having a “smart” component, there will be an even large number of events that need to processed in a multi-actor, multi-platform environment.
The group says that end-user applications could benefit from the possibility for open access to all relevant sensors and data sources. They claim that now, the work being done with semantic sensor networks relies on open technologies for harvesting and integrating this data but they are looking for ways applications can more effectively access a shared set of sensors while avoiding redundant data acquisition that would lead to energy efficiency problems.
To solve these problems they propose a novel event processing platform based on the Rete algorithm, which they offers continuous execution of interconnected SPARQL queries and updates rules. The platform, called INSTANS, along with Rete, enables sharing of sensor access and caching of intermediate results in a natural and high performance manner. The group says that with incremental query evaluation, standard-based SPARQL and RDF can handle complex event processing tasks that work with the shared sensor access goals they seek to achieve.
When Big Data Means Lost Data
David Maier and V.M. Megler from Portland State’s Department of Computer Science tackle another side of the big data challenge—this time looking at the issue from a meta-management perspective.
They note that in the past, scientists’ biggest concern was that they lacked enough data to carry out their work, but now the tables have turned. It’s not just that there’s too much to store or handle, narrowing down to what’s important can be so cumbersome that they might as well not have collected it at all if it can’t be accessed.
The team used an existing scientific archive to test a possible solution to this problem via adapting information retrieval techniques that were developed for combing through scientific data in text format. Their approach uses a blend of automated and “semi-curated methods to extract metadata from large archives of scientific data. They then search across these archives’ extracted metadata and have results retuned that are ranked in similarity to the query terms.
The team puts this in the context of their work at an ocean observatory where they examined the effectiveness of the approach as well as the performance and scalability angles to see how continuous growth of those archives would affect their goals, with positive results.
Visualizing Networks for Social Science Research
A team of researchers from the University of Southampton in the U.K. have presented an approach to harvesting and visualization of massive volumes of data to render it usable for social science research. To put their research in context, the group focused on data from Twitter to demonstrate their methodological approach.
The goal was to provide a new software tool that can provide visualization of social networks that emerge within a network by categorizations. The group says that for what they’ve attempted to demonstrate, Twitter is an ideal test as it is a dynamic social network that offers the immediately visible traces of apparently spontaneous social interactions and relationships. The “hashtags” of Twitter allow them to better understand the networks and social interactions that play out.
Using data from Twitter, they have enabled a timeline of communications around a specific hastag to be visualized based on retweets, thus allowing them to identify influential tweets and tweeters within this micro-network over time to see how networks form. They can then interact with this data, pausing at certain moments in time, zoom in to view more information and understand individual and group roles.
“This is not only providing a detailed understanding of the communications between actors, but exploring the pathways and flow of information pushes the methodological approach to understanding big data in terms of its dynamic nature, something that helps explain the translation and process of a network.”
The Emergence of Discovery Informatics
Yolanda Gil from the University of Southern California and Haym Hirsh from Rutgers University are seeking novel ways to integrate further artificial intelligence technologies into the next generation of scientific research.
The duo recently described the concept of “Discovery Informatics” which they say is an emerging area of research focused on computing advances that target scientific discovery processes requiring knowledge assimilation and reasoning, and applying principles of intelligent computing and information systems to understand, automate, improve and innovate any aspects of those processes.
The authors discuss the potential of Discovery Informatics as it relates to science, dealing with both with the big data angle and the “long tail” of science. To highlight their points, the team focuses on two areas of research for information and intelligence systems; workflows of scientific processes and citizen science, which they say are two of the best application areas for intelligent systems to provide scientific discovery processes.
While this is more of a theory-based article, it’s nonetheless interesting from a research perspective as it represents the new ways of thinking that emerging in the wake of the ever-growing influx of big, complex data.