The Week in Research
This week’s selection of research items of interest in the data-intensive computing ecosystem includes new ways of visualizing textual data, lifelong machine learning, an interesting approach to the creation of social graphs and a look at the future for RDF and SPARQL in big data environments.
In case you missed it, here is last week’s edition of research briefs. Let’s dive in with our first item:
Visualizing Streaming Text Data
A team from AT&T Labs has focused on the endless text-based streams that are making it more challenging to analyze and discover relevant information. To address these challenges, they put forth an approach for visualizing text streams in real-time presented as a dynamic graph with an associated map.
The group says that this approach automatically groups similar messages into “countries” with keyword summaries, using semantic analysis, graph clustering and map generation techniques. They say it handles the need for visual stability across time by dynamic graph layout and Procrustes projection techniques, enhanced with a “novel stable component packing algorithm.”
The result, they say, offers an ongoing, accurate view of evolving topics of interest. They put this in context using an online service called TwitterScope.
Next — Lifelong Machine Learning >
Lifelong Machine Learning
According to Qiang Yang from Huawei Technologies, the flood of new data types requires a more robust data-mining system that can keep pace with changing data in a continual manner.
Qiang discusses how this creates a need for Lifelong Machine Learning, which in contrast to the traditional one-shot learning, should be able to identify the learning tasks at hand and adapt to the learning problems in a sustainable manner.
More specifically, a foundation for lifelong machine learning is transfer learning, whereby knowledge gained in a related but different domain may be transferred to benefit learning for a current task. To make effective transfer learning, he argues that it is important to maintain a continual and sustainable channel in the life time of a user in which the data are annotated.
Qiang outlines lifelong machine learning situations, gives several examples of transfer learning and applications for lifelong machine learning, and discusses cases of successful extraction of data annotations to meet the big data challenge.
A Scalable Social Graph Generator
According to a research team from European organizations CWI and OpenLink software, benchmarking graph-oriented database workloads and graph-oriented database systems is increasingly becoming relevant in analytical big data tasks, such as social network analysis.
They argue that with graph data, structure is not mainly found inside the nodes, but especially in the way nodes happen to be connected, i.e. structural correlations. Because such structural correlations determine join fan-outs experienced by graph analysis algorithms and graph query executors, they are an essential, yet typically neglected, ingredient of synthetic graph generators.
To address this, the presents S3G2: a Scalable Structure-correlated Social Graph Generator. This graph generator creates a synthetic social graph, containing non-uniform value distributions and structural correlations, which is intended as test data for scalable graph analysis algorithms and graph database systems. They generalize the problem by decomposing correlated graph generation in multiple passes that each focus on one so-called correlation dimension; each of which can be mapped to a MapReduce task.
The team demonstrates that S3G2 can generate social graphs that (i) share well-known graph connectivity characteristics typically found in real social graphs (ii) contain certain plausible structural correlations that influence the performance of graph analysis algorithms and queries, and (iii) can be quickly generated at huge sizes on common cluster hardware.
What RDF and SPARQL Bring to Big Data
According to Bob DuCharme a solution architect from Virginia-based TopQuadrant,there is still a solid future ahead for RDF.
He argues that The Resource Description Format (RDF), a W3C standard since 1999, which describes a data model that can represent most known structured and semi-structured data formats, has innate simplicity and flexibility.
He notes that the accompanying standards, such as the SPARQL query language and an optional schema language, also provide a great infrastructure for addressing many of the issues that make big data different from traditional relational database management.
DuCharme says that because of these features, both open-source efforts and offerings from commercial vendors such as IBM, Oracle, and Cray have found that RDF technology offers an excellent platform for taking an agile approach with large, dynamic aggregations of data that won’t fit neatly into predefined tables.
Because RDF technology is all built from public standards, offerings from more specialized vendors such as triplestores from Allegrograph and Stardog and the TopBraid application platform from TopQuadrant can mix and match with Cray, IBM, and Oracle’s offerings as well as with open-source tools to create applications that can start small and provide a basis for incremental growth up to trillions of triples.