Follow Datanami:
February 22, 2018

DataTorrent Glues Open Source Componentry with ‘Apoxi’


Building an enterprise-grade big data application with open source components is not easy. Anybody who has worked with Apache Hadoop ecosystem technology can tell you that. But the folks at DataTorrent say they’ve found a way to accelerate the delivery of secure and scalable big data applications with Apoxi, a new framework they created to stitch together major open source components like Hadoop, Spark, and Kafka, in an extensible and pluggable fashion.

As the commercial outfit behind the Apache Apex stream processing engine, DataTorrent has a front-row seat into the open source software community. But Guy Churchward, who took over the CEO position from DataTorrent founder Phu Hoang about a year ago, comes from an enterprise software background, and his time spent at BEA Systems, NetApp, LogLogic, and EMC has taught him a thing or two about what big companies expect out of their IT investments.

While “open source big data software” and “enterprise computing” aren’t mutually exclusive ideas in Churchward’s mind, they definitely face some steep compatibility issues. Churchward discussed his observations of the big data space with Datanami last week.

“One of the things that’s been haunting me is the amount of noise in the space and the amount of componentry, and frankly the difficulty in getting a real application into production,” Churchward said. “That’s kind of punishing to customers.”

The Hadoop and big data application landscape is “fraught with tons of components,” he said. “You end up then working with an engineering organization inside of a company that’s kicking the tires of 15 different types of solutions including 100 different types of components. When you step back, you realize there’s got to be a better way to do it.”

Fast, Cheap, or Good: Pick Two

The benefits of using big data software are too great for enterprises to pass up. Fortune 2000 firms would be foolish to ignore the energy going into the development of Hadoop, Spark, and Kafka which are driving real innovation in storage, processing, and stream processing, respectively.

Complexity is a productivity killer in big data (Agor2012/Shutterstock)

However, the costs associated with actually building useful applications out of these open source projects are steep enough to cause heartburn in the CIO’s office. It’s led many to question the benefits of Hadoop, which has been the standard-bearer for the big data style of computing, even though there’s really nothing else comparable to replace it (at least for on-premise computing).

In Churchward’s view, enterprises have three paths to get in on the big data action and partake of the real benefits that it can deliver:

  1. Buy a commercial off-the-shelf system from a software vendor;
  2. Contract with a service provider to build it for them;
  3. Hire a bunch of engineers and build it themselves.

“Each one of these has a really good benefit but also side effects that are a bit painful,” he said. “If I take a custom application, I get enterprise grade capabilities, because the application has been around for a fair amount of time and I’ll get good time to value, but the problem is it’s not reusable because it’s a point solution.

“If I go to somebody like Hortonworks or Cloudera as a general practitioner, they can put an application together really well and I’m getting good time-to-value,” he continued. “But it’s not exactly so good for reusability either and my cost-of-ownership goes through the window.

“If I hire my own people, I can get an application which is reusable, but then enterprise-grade really goes out the window because they’re not used to stitching these components well. Time-to-value just doesn’t exist. It normally takes a couple of years for them to vet it in. That’s why you see a bunch of Hadoop applications kind of fail because each of the components are not designed to fit together. They’re designed as individual pieces.”

DataTorrent hopes to give customers a fourth option with Apoxi, a new application framework unveiled today with the launch of DataTorrent RTS version 3.10.

Apoxi is designed to bind various pre-selected components together so customers can create their own big data applications. The idea is that DataTorrent will do the hard work of integrating major components together in such a way that it creates a stable backplane upon which real-time, bi-directional big data applications can be built.

Loosely Coupled, Tightly Integrated

“Basically, it’s application glue,” Churchward said of Apoxi. “We’re saying, look, you can have different types of micro data services. Spark ML would be a data services. Apex is a data service.  If Apex turns out not to be the resulting streaming engine, and maybe Flink turns out to be better, or Spark finally gets real time, or Beam happens in some respects, then you should be able to then plug it in.”

If another big data technology emerges and DataTorrent RTS customers discover they need it, the onus is on DataTorrent to do the engineering work to make it fit into Apoxi. And while Apoxi would function as the glue holding various pre-selected open source components together, it is not open source itself. “Our job is to basically say to an enterprise, we will grab the open source you like, we’ll encapsulate it, and we’ll drive toward an outcome,” Churchward said.

DataTorrent says Apoxi will abstract away some of hte complexity in building big data apps

Apoxi will ship with Kafka, Apex, Spark, and Hadoop, the core “KASH” stack that provides the bulk of the functionality customers will need. On top of that DataTorrent is supporting in Apoxi several other less-well-known open source components, including Druid and Drools.

Druid, you will remember, is column-oriented in-memory OLAP data store that Yahoo took open source three years ago. Drools, meanwhile, is a Java-based business rules engine developed by Red Hat that has advanced forward and backward chaining.

In addition to this architecture, Apoxi brings streaming capabilities, including a store and replay functionality that lets users conduct a post-mortem analysis of events. It will also let customers conduct AB testing with multiple engines in parallel. The software also supports Python and PMML for ingestion of machine learning models.

“Our idea is basically to have a universal snap-on architectures that enables these individual components to look like a synthetic application, but I want to make sure that none of the components are bound so tight that they can’t be replaced,” Churchward said. “This is what customers want. It’s the only way this space is actually going to be successful without extreme heavy lifting.”

Related Items:

Why 2018 Will Be All About the Data

Hadoop Has Failed Us, Tech Experts Say

How Kafka Redefined Data Processing for the Streaming Age