Follow Datanami:
December 1, 2012

The Week in Big Data Research

Datanami Staff

This week in the world of data-intensive research and development we deliver two stories from IBM’s Software Lab in Canada that pull out the stops on its own Hadoop flavor and tackle the issues of data diversity in real-world settings. We also show off the teeth on a Shark that could provide some chum for Hadoop and MapReduce buffs among other items.

Also, in case you missed last week’s overview, we covered Spatio-Temporal Data, Hadoop ,Graph Partitioning,Visual Exploration,Semantics and more…

Let’s kick off the week in Canada with…

Big Data Crunching on IBM’s Flavor of Hadoop

A team from IBM Canada Software Lab put its own Hadoop distro BigInsights to prove its viability on complex datasets that require massive-scale computation at speed and scale.

The IBM distro integrates an IBM-created open source query language called JAQL (Jackal) with the usual components such as Hive, HBase, and Pig. JAQL allows the user to query through large sets of data in JSON (JavaScript Object Notation) form, which is the native data format of Hadoop.

The team describes the the Java libraries at the core of Hadoop, which are enhanced by the Pig high level language, the HBase database, the Hive data warehouse system, and the Flume log aggregation service. They show how each of these makes Hadoop more powerful at dealing with larger volumes of data, greater varieties of data, and quicker velocities of data.

While more of a vendor-based research piece, this article provides a useful overview of Hadoop in general and the differences that are inherent to the BigInsights offering. Worth a read for those looking beyond the “big three” of Hadoop vendors in enterprise settings (Cloudera, Hortonworks and MapR).

NEXT — Shark Attack on SQL and Analytics >


Shark Attack on SQL and Analytics

A research team from the AMPLab at EECS on the UC Berkeley campus has described a new data analysis system that they say marries query processing with complex analytics on large clusters.

The approach, which they call Shark, uses a “novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (including iterative machine learning) at scale, and efficiently recovers from failures mid-query.”

The goal of Shark, among other things, is to allow SQL queries that the team claims can run up to 100x faster than Apache Hive and machine learning programs at 100x faster than Hadoop. They say that unlike previous systems aimed at the same end goal, “Shark shows its possible to achieve these speedups while retaining a MapReduce like execution engine and the fine-grained fault tolerance properties that such engines provide.”

In the end, they claim that the final result is a system that can match the speedups reported across MPP analytical databases over MapReduce while retaining the ability for a solid fault tolerance approach that other systems might lack.

NEXT — Enterprises, Meet DIANE >


Enterprises, Meet DIANE

It’s been a busy week at the IBM Canada Software Lab. A second item this week rolls out those same quarters, this time with an emphasis on the problems large enterprises face as they contend with data of different types, coming from diversified, silos data sources that are critical to the business.

The duo behind the research (Joanna Ng and Diana Lau) note that although applying informatics technologies to these data can drastically bring benefits to the business, the technology adoption is low. They focused on two areas in the most need of solutions to these problems, namely personalized medicine and IBM’s own customer support division. They claim that these made ideal choices because both domains have large data sets of different data sources and types kept in silo, which can be made available for the studies.

To solve some of the complex data-related challenges, the pair proposed what they call Domain Informatics Analytics for Enterprise (DIANE). DIANE is an informatics platform with informatics services accessible through unified, simple-to-use, always-on and device adaptive user interfaces. It acts in a manner that is agnostic to the diversity of data types and data source locations, with queries expressible in natural language through pre-defined, reserved vocabularies. However, queries are agnostic to informatics types (such as analytics, information retrieval, semantic reasoning or data mining).

The state that “By leveraging a data acquisition framework, subject matter experts (SME) can systematically and prescriptively plug in trustworthy and relevant data of all types and sources for the domain of concern. Such platform is enabled with visualizations and interactions of query results with high precision of cognitive performance. The two studies also evaluated the consumability of building such domain oriented informatics platform by IT professionals and consuming informatics services from such informatics platforms by business users for regular enterprise operations.”

Next — A New Role for Social Science Researchers >


A New Role for Social Science Researchers

Ulf-Dietrich Reips argues in the International Journal of Internet Science that one of the more interesting streams of research on big data is aimed at analyzing online networks.

Reips says that many online networks are known to have some typical macro-characteristics, such as ‘small world’ properties, although he claims that much less is known about underlying micro-processes leading to these properties.

In his piece, he describes how the models used by Big Data researchers usually are inspired by mathematical ease of exposition. He and his co-author propose to follow a different strategy that leads to knowledge about micro-processes that match with actual online behavior. This knowledge can then be used for the selection of mathematically-tractable models of online network formation and evolution. Insight from social and behavioral research is needed for pursuing this strategy of knowledge generation about micro-processes, providing a way for us to think about new roles social scientists could play in big data research.

NEXT — Dealing with Big Informetrics Data >


Dealing with Big Informetrics Data

Information scientist Ronald Rousseau believes that big data offer a huge challenge in the field of Informetrics. He recently made the argument that the very existence of big data leads to the contradiction that the more data we have the less accessible they become, as the particular piece of information one is searching for may be buried among terabytes of other data.

In his article, he and his team discuss the origin of big data and point to three challenges when big data arise: data storage, data processing and generating insights. Design/methodology/approach: Computer-related challenges can be expressed by the CAP theorem which states that it is only possible to simultaneously provide any two of the three following properties in distributed applications: Consistency (C), availability (A) and partition tolerance (P).

As an aside the team mentions Amdahls law and its application for scientifi c collaboration and they delve further to discuss data mining in large databases and knowledge representation for handling the results of data mining exercises. This is all put into context via a short informetric study of the field of big data, where they find that while there still are serious problems to overcome before the fi eld of big data can deliver on its promises.

Datanami