The Week in Big Data Research
This week’s research brief brings news of cutting-edge work at global centers as teams try to keep refining Hadoop’s security, reliability and functionality—both on premise and in cloud environments. We also take a look at some interesting uses of MapReduce with projects that address image processing requirements—not to mention data integrity.
In case you missed it, here is last week’s edition of the Week in Research… Without further delay, let’s launch in with our top item this week:
A Leap Toward Hadoop Fault Tolerance
A team of researchers from Osaka University have proposed an approach to creating a more fault-tolerant Hadoop through the creation of an auto-recovering Job Tracker.
The team notes that while Hadoop is able to provide a decent level of reliability, the job scheduler, called the JobTracker, remains the single point of failure for many systems. Specifically, if the JobTracker fails to stop during a job execution, the job is cancelled immediately and all of the intermediate results are lost.
To counter this, the team points to its auto-recovery system that allows for a failsafe stopper without adding any additional hardware overhead. This approach is based on a recovery mechanism that is based on a checkpoint method wherein a snapshot of the JobTracker is stored on a distributed file system at regular intervals. When the system detects the fail-stop by using timeout, it can automatically recover the JobTracker via the snapshot.
According to the research team, the key feature here is this “transparent recovery such that a job execution continues during a temporary fail-stop of the JobTracker and completes itself with a little rollback. The system achieves fault tolerance for the JobTracker with overheads less than 43% of the total execution time.”
A Reliable BigTable for the Public Cloud
A team from North Carolina State University is taking aim at data integreity for distributed storage system for big data, BigTable, in the context of public cloud environments.
The researchers note that while rolling out BigTable on public clouds presents an attractive option from a cost savings point of view, especially for small businesses or smaller research groups with big data problems, there are several issues these users are faced with when considering public clouds, either because of security concerns or worries over data integrity with the storage system running in the cloud.
What they’ve proposed to counter these concerns is called “iBigTable” which is an “enhancement of BigTable that provides scalable data integrity assurance.” They have considered the practicality of different authenticated data structures around BigTable and have designed a set of security protocols to “efficiently and flexibly verify the integrity of data returned by BigTable so that existing applications over BigTable can interact with iBigTable seamlessly with minimum or no change of code.”
To prove their model the team implemented a prototype of iBigTable based on HBase, which itself as an open source form of BigTable. They were able to demonstrate how their iBigTable can offer “reasonable performance overhead while providing integrity assurance.”
On a More Robust, Secure Cloud-Based Hadoop
A research team from National Sun Yat-sen University in Taiwain are also tackling security issues for a cloud-based Hadoop framework, but are also turning their eyes toward the important matter of overall application performance.
The researchers note that cloud computing platforms offer a convenient solution for addressing challenges of processing large-scale data in both academia and industry, beyond what could be achieved with traditional on-site clusters. They say, however, that while there are a great number of on-line cloud services that are attractive environments, the security issue is getting more and more significant for cloud users.
According to the group, whereas Hadoop-based cloud platforms are currently a well-known service framework, they have been focusing their investigation on the mechanisms of authentication and encryption of Hadoop.
The team has constructed what they call a secure Hadoop platform with small deployment cost, robust attacking prevention, and less performance degradation. To prove their model they have run a number of simulations to evaluate the performance under different parametric settings and cryptographic algorithms.
The researchers note that the simulation results reveal the feasibility of security mechanisms, and find that the more important thing to construct cloud platforms with appropriate security mechanisms is to consider the application requirements, which could be a better trade-off between security and user requirement.
Pegging the Outliers in Big Data
According to a research team from the Centre for AI and Robotics in Bangalore, India, the rapid growth in the field of data mining has lead to the development of various methods for outlier detection.
Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. To address this, the team has proposed a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers.
In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. They say the proposed algorithm can perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data.
The team says that unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm was demonstrated through experiments on various public domain categorical data sets.
MapReduce Paves the Way for CBIR
Recently, content based image retrieval (CBIR) has gained active research focus due to wide applications such as crime prevention, medicine, historical research and digital libraries.
As a research team from the School of Science, Information Technology and Engineering at theUniversity of Ballarat, Australia has suggested, image collections in databases in distributed locations over the Internet pose a challenge to retrieve images that are relevant to user queries efficiently and accurately.
The researchers say that with this in mind, it has become increasingly important to develop new CBIR techniques that are effective and scalable for real-time processing of very large image collections. To address this, the offer up a novel MapReduce neural network framework for CBIR from large data collection in a cloud environment.
The team has adopted natural language queries that use a fuzzy approach to classify the color of images based on their content and apply Map and Reduce functions that can operate in cloud clusters for arriving at accurate results in real-time. Preliminary experimental results for classifying and retrieving images from large data sets were quite convincing to carry out further experimental evaluations.