In this mid-November edition of the Week in Big Data Research and Development we touch on a number of interesting projects that take aim at very specific problems with everything from MapReduce, to security and energy efficiency in big data cloud environments, to new SQL query approaches that could alter performance expectations.
The collection this week features research work from a number of universities, with a larger number this edition coming out of China, which is the source of a great deal of MapReduce-related research as of late. In case you missed it, last week's edition can be found here.
Without further delay let’s dive in with:
Self-Adjusting to Big Data
Umut Acar from Carnegie Mellon University and Yan Chen from the Max Planck Institute for Software Systems recently published research on the use of “self-adjusting computation” with large streaming data sets.
The duo argues that since many big data computations involve processing data that changes incrementally or dynamically over time, users are left with existing techniques that make such computations impractical. For example, computing the frequency of words in the first ten thousand paragraphs of a publicly available Wikipedia data set in a streaming fashion using MapReduce can take as much as a full day.
Acar and Chen propose an approach based on self-adjusting computation that they claim can dramatically improve the efficiency of such computations. To highlight this, they demonstrate how they were able to perform the aforementioned streaming computation in just a couple of minutes.
Team Debuts High-Throughput SQL Query System
A group of researchers from the Technology Center of Software Engineering at the Institute of Software within the Chinese Academy of Sciences in Beijing has proposed a new performance-conscious SQL query system designed for high throughput on big data applications.
The researchers claim that while relational data query always plays an important role in data analysis, there remains the problem of scaling out a traditional SQL query system to contend with. To address these bottlenecks, the team describes a fast, high throughput and scalable system to perform read-only SQL well with the advantage of NoSQL’s distributed architecture.
In explaining the approach, the team goes into detail about how they were able to adopt HBase as the storage layer and design a distributed query engine (DQE) collaborating with it to perform SQL queries. Their system also contains distinctive index and cache mechanisms to accelerate query processing.
To put the concept into practical test use, the team evaluated their system with real-world big data crawled from Sina Weibo, allowing them to demonstrate how it was able to achieve good performance under nineteen representative SQL queries.
NEXT -- Whittling Down MapReduce >
Whittling Down MapReduce
A group of researchers from Renmin University of China in Beijing have taken aim at the top-k query within MapReduce, which they say is one of the most useful queries in Map-Reduce for working with big data sets.
The team believes that MapReduce is useful, but there is a significant risk to leak out users' personal information, especially when the data is sensitive, for example, including users' health records, salary information, etc. To counter this, the concept of differential privacy has recently emerged as a new paradigm for preserving private data, which makes it possible to provide strong theoretical guarantees on the privacy and utility of the query results.
Motivated by this, the team proposes an efficient algorithm, called DiffMR Differentially private Top-kquery over MapReduce), for processing top-k query as well as satisfying differential privacy. In the algorithm, to avoid the private leak in middle process, they use an exponential mechanism to select top-k records from big data sets by using score function. When the data set is too large to get a reasonably accurate result, they can reduce the reject rate and execute several more times Map-Reduce to get a more accurate top-k query result.
They demonstrate how after getting a final top-k candidate result, they can add Laplace noise to each record and adopt post-processing technique to improve the accuracy of query answers. The experimental study demonstrates that DiffMR algorithm can be used to answer the top-k query accurately in Map-Reduce framework.
NEXT -- Cooling Off on Big Data Clouds >
Cooling Off on Big Data Clouds
Rini Kaushik and Klara Nahrstedt from the University of Illinois, Urbana-Champaign prepared research for the annual supercomputing show (SC12) this year that addresses the power and cooling costs of data-intensive computing.
The team says that the explosion in data has led to a surge in extremely large-scale analytics platforms, resulting in burgeoning energy costs. This wider adoption of advanced big data analytics platforms also involves the need for strong data-locality for computational performance, which usually means moving computations to data.
The researchers argue that state-of-the-art cooling energy management techniques rely on thermal-aware computational job placement/migration and are inherently data-placement-agnostic in nature. To counter these problems, they propose a new approach, called T* which takes a novel, data-centric view of reducing cooling energy costs and to ensure thermal-reliability of the servers.
The team states that T* is cognizant of the uneven thermal-profile and differences in thermal-reliability-driven load thresholds of the servers, and the differences in the computational jobs arrival rate, size, and evolution life spans of the big data placed in the cluster. Based on this knowledge, and coupled with its predictive file models and insights, T* does proactive, thermal-aware file placement, which implicitly results in thermal-aware job placement in the big data analytics compute model.
To put their approach in context, they present evaluation results with one-month long real-world big data analytics production traces from Yahoo! which show up to 42% reduction in the cooling energy costs with T* courtesy of its lower and more uniform thermal-profile and 9x better performance than the state-of-the-art data-agnostic cooling techniques.
Toward Secure, Distributed Big Data Storage
A duo from the National Institute of Information and Communications Technology in Tokyo has presented a new approach to secure distributed storage for bulk data that emphases security in a novel manner.
The researchers suggest that distributed data storage techniques are important especially for the cases where data centers are compromised by big natural disasters or malicious users, or where data centers consist of nodes with low security and reliability. While they claim that techniques using secured distribution and Reed-Solomon coding have been proposed to cope with the above issue, they are not efficient enough for dealing with big data in cloud computing in terms of return-on-investment.
The approach the team presents maintains higher security levels by using packaging techniques that do not require key management inherent in AES encryption. The team says that aside from this, it scales out so that it is capable of storing a large amount of data safely and securely. The performance of the architecture is also dealt with in terms of storage efficiency and security evaluation.