Follow Datanami:
April 23, 2014

Crossing the Big Data Stream with DataTorrent

Alex Woodie

Enterprises eager for a competitive edge are turning to in-memory stream processing technologies to help them analyze big data in real time. The Apache Spark and Storm projects have gained lots of momentum in this area, as have some analytic NoSQL databases and in-memory data grids. Another streaming technology worth keeping an eye on is DataTorrent.

DataTorrent was founded in 2012 by two former Yahoo engineers, CEO Phu Hoang and CTO Amol Kekre, who each spent more than 12 years building big data and Hadoop solutions at the Internet giant. While at Yahoo, they saw the growing need for speed in big data processing, and so they set out on their own to build a next-generation in-memory streaming technology that can process huge amounts of data in a reliable, enterprise-worthy manner.

The DataTorrent product is one of a new class of YARN-compatible Hadoop 2.0 applications that seeks to remove MapReduce from the big data equation. There’s a bit of work left to do on the DataTorrent product, which is still in beta. But if it launches as expected later this year and delivers on all the enterprise-level stuff its developers are promising—including fault tolerance, high availability, dynamic scalability, and the capability to handle application updates without downtime–it could put an interesting new wrinkle in the race to build real-time analytic machines for the Internet of Things.

Hoang recently spoke with Datanami about the on-going explosion of data, how companies may leverage big data in the future, and what makes DataTorrent different from other stream processing engines.

“Until we came along I don’t think anybody could be talking about dealing with billions of events per second in a fault-tolerant way,” Hoang says. “But as much as we’re talking about big data today, most of that is human-generated. I think the next phase of big data–where you have massive machine-generated data from appliances, from sensors, from phones and so on–can really dwarf the kind of scale and volume that you’re seeing now.”

As the IoT fire hose ramps up, stream processing engines will become a more essential piece of the big data puzzle. Hoang says that, while the business use cases around stream processing are still being worked out, DataTorrent has the potential to be a disruptive nonetheless.

“Instead of having all these events come in and saving them in a file to be executed in batch, they now get the ability to bring them right in and get the kind of distributed processing on a massive scale that can happen,” he says. “This whole notion of digesting things and see and being able to understand your business in real time and reacting on it is still new. For a lot of enterprises, most of them have just gotten the batch Hadoop stuff to work. And now comes this new thing, and they’re excited about the opportunities.”

DataTorrent Architecture

Conceptually, the DataTorrent product works like other in-memory stream processing products. A user starts by building a data flow graph–a pipeline, essentially—that defines the data inputs, the transformations or operations, and the outputs. DataTorrent uses a collection of 400 operators that users can plug in to perform actions on the data. It works with a variety of message bus products on the data input side–such as Kafka, Rabbit MQ, Apache Flume–as well as output adapters for databases like Redis, MySQL, HBase, and Cassandra.

In September, DataTorrent contributed its library of 400 operators, called Malhar, to the open source community; it’s available at GitHub under an Apache license. Users can also write their own DataTorrent operators in Java, or wrap existing business logic around the operators to accomplish specific tasks.

DataTorrent is focusing on easing the operational aspects of running an enterprise-scale stream processing engine. “We really took the effort to separate the functional specifications from the operability of the application,” he says. “Our claim here is that DataTorrent will handle all the operability for you–how to scale up and out, how to dynamically scale as your nodes dynamically change. How do fault tolerance. How to do all that for you, and all you have to focus on is your functional logic on the top.”

Many of DataTorrent’s early customers are using the software to perform real-time ETL processes that may have previously been MapReduce jobs. That includes customers in the telecom business that are looking to pull information out of call data records (CDRs), perhaps to extract geo-location data from smart phones to make customers offers. Companies in the advertising business are also running DataTorrent to aggregate data on Web-based ads in order to optimize their spending.

In both use cases, getting a “buy” signal in real time is critical to capturing revenue; a batch-oriented system just doesn’t work as well. “It’s one thing when the batch job runs and you get your data 30 hours later, to reduce that to 10 hours. That’s an incremental value,” Hoang says.

Spark and Storm Comparisons

“But all of a sudden, when you’re able to get the processing and insights as it’s live streaming in, and you’re getting that same insight in a second—it’s a profound change for the business, because now they can actually do something about it,” he continues. “Whether it’s fraud detection or cyber security—you’re now finding these anomalies, or the things you want to detect, while the event is still happening, and the ability to take action on it is really the goal that we’re after.”

DataTorrent is facing some stiff competition in this emerging part of the big data market. Apache Spark, which includes Spark Streaming as one of several big data engines available in the open source bundle, has a huge amount of momentum right now, and is perhaps the strongest MapReduce alternative poised to pull Hadoop from version 1 to version 2.

But according to Hoang, DataTorrent retains some key advantages over Spark, in both performance and architecture. On the performance side, Hoang says internal benchmarks show DataTorrent runs 100 times faster than Spark, and 1,000 times faster than Storm.  “[CTO Amol Kekre and I] have over 15 years of streaming operability experience under our belts. We’ve made a lot of mistakes we’ve learned from,” he says. “We have some very strong architects and engineers who have done that for the better part of their lives.”

One of the ways that DataTorrent has such a speed edge over Storm is the fact that Storm issues an acknowledgement of event, Hoang says. “As you get to tens of thousands of events

DataTorrent co-founder and CEO Phu Hoang

per second, you’re spending all your time doing that bookkeeping and you start to max out,” he says.

Spark Streaming, meanwhile, drags around its MapReduce heritage, Hoang says. “The underlying architecture is MapReduce’s scheduling architecture,” he says. “They just sort of shortened the batch window so they can have smaller batches. But in the end they’re still scheduling map and reduce batch jobs, so the scheduler and the control logic is involved in every step of the processing.”

What makes DataTorrent’s architecture unique, Hoang says, is the separation of logic it implements when running on Hadoop. “We have a data flow graph, we map that onto a Hadoop cluster, then each of the operators in each of the containers are now on their own,” he says. “All their focus is to crunch incoming events and produce outgoing events. The controller is not involved. We minimize that controlling logic to keep the bookkeeping down to medium.  So all we do is stream the events in and out.”

While the speedup is good, what Hoang hopes will really differentiate DataTorrent from Spark and Storm is the capability of his product to maintain statefullness and to deliver fault tolerance. “That notion for us is key,” he says. “That concept of saving the memory footprint of each of those operators at regular intervals is a very important concept in being able to come back, self-heal, and pick up where you left off without any loss of data.”

DataTorrent is one of a new class of vendors looking to capitalize on the need for real-time processing of massive amounts of data. The company’s product won’t solve all big data needs, but if the early reports are true, it will have a bright future in the IoT.

Related Items:

Glimpsing Hadoop’s Real-Time Analytic Future

Rethinking Real-Time Hadoop

Apache Spark: 3 Real-World Use Cases