Follow Datanami:
February 16, 2013

The Week in Research

Datanami Staff

In this week’s research brief we hone in on some specific tweaks and refinements to the Hadoop stack, including at the HDFS level. From creating a new file system for the platform to evaluating how modern architectures influence the course of big data (not to mention results from an actual course on data-intensive apps), we have some interesting research news to share.

Creating a Reliable Hadoop-Based File System

A group of researchers from the School of Computer Science at the National University of Defense Technology in China, there is a clear and present need for a trusted file system for Hadoop that addresses overall data security.

The team describes their design,which uses the latest cryptography—fully homomorphic encryption technology and authentication agent technology, to ensure the reliability and safety from the three levels of hardware, data, users and operations.

According to the researchers, the homomorphic encryption technology enables the encrypted data to be operable to protect the security of the data and the efficiency of the application. The authentication agent technology offers a variety of access control rules, which are a combination of access control mechanisms, privilege separation and security audit mechanisms, to ensure the safety for the data stored in the Hadoop file system.

Next – Big Data on Modern Hardware Architectures >


Big Data on Modern Hardware Architectures

Big data analytics has the goal to analyze massive datasets, which increasingly occur in web-scale business intelligence problems.

According to analysis presented in a recent book from Michael Saecker and Volker Markl, the common strategy to handle these workloads is to distribute the processing utilizing massive parallel analysis systems or to use big machines able to handle the workload. On this note, the duo discusses massively parallel analysis systems and their programming models.

In addition, they also address the application of modern hardware architectures for database processing, noting that today, many different hardware architectures apart from traditional CPUs can be used to process data.

The team says that GPUs or FPGAs, among other new hardware, are usually employed as co-processors to accelerate query execution. The common point of these architectures is their massive inherent parallelism as well as a different programming model compared to the classical von Neumann CPUs. They argue that such hardware architectures offer the processing capability to distribute the workload among the CPU and other processors, and enable systems to process bigger workloads.

Next — HPC Computation on Hadoop Storage >


HPC Computation on Hadoop Storage

A team of researchers from Carnegie Mellon University, including storage pioneer, Garth Gibson, discussed the possibilities of HPC workloads using Hadoop storage and an adapted version of the Parallel Log Structured File System (PLFS).

The team describes how they adapted PLFS to enable HPC applications to read and write data from the HDFS cloud storage subsystem.

They say their enhanced version of PLFS can offer HPC applications the ability to concurrently write from multiple computer nodes into a single file stores in HDFS, which allows these applications to checkpoint.

The group’s results show that HDFS, when combined with their PLFS HDFS I/O storage module can handle a concurrent write checkpoint workload generated by a benchmark with good performance.

Next – On Teaching MapReduce via Clouds >


On Teaching MapReduce via Clouds

A team from the University of California, Berkeley, described their experiences teaching MapReduce in a large undergraduate lecture course using public cloud services and the standard Hadoop API.

Using the standard API, students directly experienced the quality of industrial big data tools. Using the cloud, every student was presented with the opportunity to carry out scalability benchmarking assignments on realistic hardware, which would have been impossible otherwise.

The group reports that over two semesters, over 500 students took our course. At that level, they say they believe this is the first large-scale demonstration that it is feasible to use pay-as-you-go billing in the cloud for a large undergraduate course.

The team says that modest instructor effort was sufficient to prevent students from overspending. Average per-pupil expenses in the Cloud were under $45. Students were excited by the assignment: 90% said they thought it should be retained in future course offerings.

Next – Making Urban Data Accessible >


Making Urban Data Accessible

A group of researchers from Harvard and MIT’s urban design and planning departments looked at a number of new technologies allow for new ways to sense the city.

Thinking of urban data as substance, the team criticizes a certain approach in dealing with urban-related data analysis when it comes to identifying certain patterns and deriving narratives based on these patterns. In this approach, coined here as the ‘big data’ approach, having access to large-volume datasets is considered sufficient to study a phenomena and its very dynamics that the data refers to.

In contrast to this, the researchers point to ways in which data can be produced, modified and delivered; in any of these steps, there is an ongoing cost–benefit analysis, based on which a series of necessary decisions in terms of resolution and quality of data through the lens of filtering has to be made with the goal of accessing data and making it accessible most effectively. Projects from MIT SENSEable City Lab are used to better illustrate these ideas.