ScaleOut Ships Its Own MapReducer for Hadoop
In the world of big data analytics, vendors of all stripes are angling to improve upon Hadoop. For many, this means bypassing the batch-based framework and enabling Hadoop to work in a real-time manner, which opens the door to many new uses for Hadoop. This is also what ScaleOut Software is attempting to do with hServer, an in-memory data grid for Hadoop MapReduce that it first launched in June.
ScaleOut Software says the new MapReduce engine that it unveiled for its in-memory data grid today will run standard MapReduce applications 20 times faster than out-of-the-box Hadoop can. hServer V2 won’t work on really big data in the petabyte range, but ScaleOut says it is quite effective for analyzing real-time operational data, as well as testing and debugging small slices of full MapReduce workloads.
The big new feature in hServer V2 is the inclusion of ScaleOut’s own MapReduce engine. ScaleOut’s MapReducer is almost a plug-and-play replacement for the standard MapReduce engine, according to ScaleOut CEO Bill Bain. It uses the same open source API, which enables all existing MapReduce applications to run on ScaleOut’s hServer. The only difference is that it requires one line of code to be inserted, Bain says.
So, what does this get the customer? Why would they want to run a MapReducer from a small Bellevue, Washington company instead of using the one that came with their Hadoop distribution?
The answers, according to Bain, are speed and flexibility.
“We’re implementing a Hadoop MapReduce engine because that’s what customers need in order to do to do real time analytics,” he tells Datanami. “We’ve measured a 20x improvement in performance over the standard Apache Hadoop distribution using this new engine. This enables users to do Hadoop MapReduce analysis on that data while it’s changing and being updated continuously in their operational systems.”
Bain sees hServer being used to implement real-time analytic applications in several areas. Financial services is one of the primary examples. The software could be used to keep track of hedging strategies and positions in a real time manner. It could also be used in an ecommerce setting to perform periodic reconciliation of orders and inventory, and to generate alerts if there are big shortfalls in supply.
Managing sensor data is another potential use for ScaleOut’s technology. “We have a customer who wants to use Hadoop on sensor data that its collecting from many sites,” he says.” This is an aeronautics company. They’re stuck with the problem that there’s no place to put that data that’s fast enough to ingest it and analyze it in real time.”
The data sets that hServer V2 will be asked to analyze range in size from, say, 50GB up to perhaps 10TB, Bain says. “They don’t tend to be large data warehouse-size data sets of the petabyte range,” he said. “This is working on operational data. These are orders pending or hedging strategies being evaluated or credit card activity that’s being checked for fraud. This is data that tends to be much smaller size and is being updated continuously. So typically the data would fit in the grid, and the MapReduce engine would act on data that’s being updated in the grid.”
These are not your big daddy Hadoop implementations. “The way to think of it is a little brother to Hadoop for real-time applications or for rapid prototyping,” Bain continues. “It’s much faster and much lighter weight. We’ve eliminated the batch scheduling, the security checks. We’re not trying to develop a multi-tenant environment. We’re trying to tightly integrate it into a live application that’s hosting and managing the operational data.”
In ScaleOut’s re-imagining of the Hadoop MapReduce architecture, data doesn’t sit still for very long, and customers don’t have the time to wait for a batch of Hadoop functions to finish processing before loading in the next batch of data.
“The problem with Hadoop is that it’s designed from the ground up for static data sets that are disk based, and that’s why HDFS is so carefully designed to do it well,” says Bain, who has a Ph.D. in electrical engineering/parallel computing and worked at Bell Labs, Intel, and Microsoft. “And it does do it very well. But you have overheads that are inherent in that environment, namely the disk I/O overhead, to move data into memory, and secondly the batch scheduling.
“What we have discovered as we pealed back the layers,” Bain continues, “is once we removed the disk-based storage and we eliminated the batch scheduling overhead by re-implementing the scheduler within our product using our PMI engine, we found there are other overheads within Hadoop in the way combing and shuffling are done that really need to be optimized to not present themselves as the new bottlenecks.”
ScaleOut’s MapReduce engine is based on its own proprietary parallel method invocation (PMI) technology, which is simply an implementation of “a very standard computation model that was developed for parallel supercomputing starting with the national labs in the 1980s,” Bain says.
“Our parallel method invocation, which we built into our APIs, is essentially the Hadoop map step with a combiner doing multi-server combining. So in some sense it’s a dialect of Hadoop. But the point is, any application that can be constructed as data parallel would run with either our PMI or Hadoop MapReduce.”
hServer V2, which was written in C for speed and runs on Windows and Linux, can sit on the same cluster of servers that house Hadoop. This approach is recommended, as it eliminates any latencies introduced by having to move data across the network.
hServer V2 is available now for Linux, and will be available for Windows in a couple of weeks. ScaleOut provides a free community edition of hServer, but it’s limited to running on a four-node cluster and a data set that doesn’t exceed 256GB.