Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan

November 17, 2012

The Week in Big Data Research

In this mid-November edition of the Week in Big Data Research and Development we touch on a number of interesting projects that take aim at very specific problems with everything from MapReduce, to security and energy efficiency in big data cloud environments, to new SQL query approaches that could alter performance expectations.

The collection this week features research work from a number of universities, with a larger number this edition coming out of China, which is the source of a great deal of MapReduce-related research as of late. In case you missed it, last week's edition can be found here.

Without further delay let’s dive in with:

Self-Adjusting to Big Data

Umut Acar from Carnegie Mellon University and Yan Chen from the Max Planck Institute for Software Systems recently published research on the use of “self-adjusting computation” with large streaming data sets.

The duo argues that since many big data computations involve processing data that changes incrementally or dynamically over time, users are left with existing techniques that make such computations impractical. For example, computing the frequency of words in the first ten thousand paragraphs of a publicly available Wikipedia data set in a streaming fashion using MapReduce can take as much as a full day.

Acar and Chen propose an approach based on self-adjusting computation that they claim can dramatically improve the efficiency of such computations. To highlight this, they demonstrate how they were able to perform the aforementioned streaming computation in just a couple of minutes.

NEXT -- Team Debuts High-Throughput SQL Query System >

Team Debuts High-Throughput SQL Query System

A group of researchers from the Technology Center of Software Engineering at the Institute of Software within the Chinese Academy of Sciences in Beijing has proposed a new performance-conscious SQL query system designed for high throughput on big data applications.

The researchers claim that while relational data query always plays an important role in data analysis, there remains the problem of scaling out a traditional SQL query system to contend with. To address these bottlenecks, the team describes a fast, high throughput and scalable system to perform read-only SQL well with the advantage of NoSQL’s distributed architecture.

In explaining the approach, the team goes into detail about how they were able to adopt HBase as the storage layer and design a distributed query engine (DQE) collaborating with it to perform SQL queries. Their system also contains distinctive index and cache mechanisms to accelerate query processing.

To put the concept into practical test use, the team evaluated their system with real-world big data crawled from Sina Weibo, allowing them to demonstrate how it was able to achieve good performance under nineteen representative SQL queries.

NEXT -- Whittling Down MapReduce >

Whittling Down MapReduce

A group of researchers from Renmin University of China in Beijing have taken aim at the top-k query within MapReduce, which they say is one of the most useful queries in Map-Reduce for working with big data sets.

The team believes that MapReduce is useful, but there is a significant risk to leak out users' personal information, especially when the data is sensitive, for example, including users' health records, salary information, etc. To counter this, the concept of differential privacy has recently emerged as a new paradigm for preserving private data, which makes it possible to provide strong theoretical guarantees on the privacy and utility of the query results.

Motivated by this, the team proposes an efficient algorithm, called DiffMR Differentially private Top-kquery over MapReduce), for processing top-k query as well as satisfying differential privacy. In the algorithm, to avoid the private leak in middle process, they use an exponential mechanism to select top-k records from big data sets by using score function. When the data set is too large to get a reasonably accurate result, they can reduce the reject rate and execute several more times Map-Reduce to get a more accurate top-k query result.

They demonstrate how after getting a final top-k candidate result, they can add Laplace noise to each record and adopt post-processing technique to improve the accuracy of query answers. The experimental study demonstrates that DiffMR algorithm can be used to answer the top-k query accurately in Map-Reduce framework.

NEXT -- Cooling Off on Big Data Clouds >

Cooling Off on Big Data Clouds

Rini Kaushik and Klara Nahrstedt from the University of Illinois, Urbana-Champaign prepared research for the annual supercomputing show (SC12) this year that addresses the power and cooling costs of data-intensive computing.

The team says that the explosion in data has led to a surge in extremely large-scale analytics platforms, resulting in burgeoning energy costs. This wider adoption of advanced big data analytics platforms also involves the need for strong data-locality for computational performance, which usually means moving computations to data.

The researchers argue that state-of-the-art cooling energy management techniques rely on thermal-aware computational job placement/migration and are inherently data-placement-agnostic in nature. To counter these problems, they propose a new approach, called T* which takes a novel, data-centric view of reducing cooling energy costs and to ensure thermal-reliability of the servers.

The team states that T* is cognizant of the uneven thermal-profile and differences in thermal-reliability-driven load thresholds of the servers, and the differences in the computational jobs arrival rate, size, and evolution life spans of the big data placed in the cluster. Based on this knowledge, and coupled with its predictive file models and insights, T* does proactive, thermal-aware file placement, which implicitly results in thermal-aware job placement in the big data analytics compute model.

To put their approach in context, they present evaluation results with one-month long real-world big data analytics production traces from Yahoo! which show up to 42% reduction in the cooling energy costs with T* courtesy of its lower and more uniform thermal-profile and 9x better performance than the state-of-the-art data-agnostic cooling techniques.

NEXT -- Toward Secure, Distributed Big Data Storage >

Toward Secure, Distributed Big Data Storage

A duo from the National Institute of Information and Communications Technology in Tokyo has presented a new approach to secure distributed storage for bulk data that emphases security in a novel manner.

The researchers suggest that distributed data storage techniques are important especially for the cases where data centers are compromised by big natural disasters or malicious users, or where data centers consist of nodes with low security and reliability. While they claim that techniques using secured distribution and Reed-Solomon coding have been proposed to cope with the above issue, they are not efficient enough for dealing with big data in cloud computing in terms of return-on-investment.

The approach the team presents maintains higher security levels by using packaging techniques that do not require key management inherent in AES encryption. The team says that aside from this, it scales out so that it is capable of storing a large amount of data safely and securely. The performance of the architecture is also dealt with in terms of storage efficiency and security evaluation.

Share Options


» Subscribe to our weekly e-newsletter


There are 0 discussion items posted.


Most Read Features

Most Read News

Most Read This Just In


Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia


Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014

» View/Search Events

» Post an Event