Glimpsing Hadoop’s Real-Time Analytic Future
There’s been a lot said about the need to move Hadoop away from the batch paradigm and remake it as a real-time system. But how will that work when it comes to the heavy-duty machine learning models and predictive analytics? That’s an area that hasn’t been fleshed out entirely, and is an area where Cloudera may help pave the way with its Oryx project.
The need for real-time predictive analytics is front and center in the mind of Sean Owen, Cloudera’s director of data science, who joined Cloudera after it acquired his machine learning startup Myrrix in 2013. The predictive algorithms Owen implemented at Myrrix form the heart of Oryx, which is proof-of-concept level code that Cloudera customers can play around with, but not get support for.
Owen draws a clear line between where the Hadoop world is right now in regards to predictive analytics, and where he thinks it needs to go, where it should go. He starts off a recent discussion about this with Datanami by getting the terms straight. “People talk about machine learning and data science, and it means a lot of things to a lot of people,” he says. “One distinction I like to draw in the conversation is between exploratory analytics and operational analytics.”
Most of the time, people are referring to exploratory analytics, i.e. discovering new insights from one’s data. When it comes time to actually do something with that insight and putting it into production, that’s where operational analytics comes into play. “So once I built my model, I need to do something with it. I need to query it,” he says. “That’s where a lot of the activity is in Hadoop these days, because there hasn’t been clear cut answer for how to do operational analytics so far on Hadoop.”
|the Cloudera Oryx archicture|
Other popular machine learning libraries, such as Apache Mahout and Spark’s MDlib, are all about batch-oriented model building in Hadoop. “And a lot of these startups again seem to be about building the model and only some of them are getting into this other question, which is how do I score models? How do I query models at runtime? How do I make that work?” he says.
The lack of real-time operational analytics on Hadoop is a major issue, and one that’s keeping the framework from achieving its potential, Owen says. There is no lack of Hadoop applications built on top of machine learning algorithms. But they lack real-time capabilities, he says.
“To make a lot of these systems relevant to do the kinds of things we need to do today with machine learning, they have to be basically real time and that requires a bit of different architecture and different ways of thinking about separating the model building form the serving,” he says. “That may sound like a trivial detail, but it’s not really. It’s actually in some ways quite the hard part, to do it fast and to do it at scale.”
The asynchronous nature of operational analytics on Hadoop today prevents organizations from implementing all sorts of potential systems. For example, think of what a company could do if they started learning about website visitors immediately upon their first visit to a website, instead of crunching the data after the fact. “When it comes down to it, a lot of people need this and don’t even realize this is a need until they get pretty far along,” Owen says. “You can’t wait a minute or an hour or a day to rebuild the model and compute the results.” In fact, you can’t even wait a second; the response time needs to be measured in milliseconds, he says.
|Cloudera director of data science Sean Owen|
At this point in the story, you might think that Cloudera would be presenting Oryx as the answer for the real-time component of operational analytics. That’s only partially true. Owen is bluntly honest about Oryx and its prospects for the future. The machine learning library, in fact, doesn’t do many things. It does classification, regression, clustering, and collaborative filtering. That, in itself, is not remarkable.
What’s notable about Oryx is that it combines the batch-oriented model building with the real-time serving and scoring that’s needed to implement the insights gleaned from the hard-core data science. The idea behind Oryx is to start the conversation around what a real-time operational analytics system on Hadoop might look like and behave.
Whether Oryx continues in its current form is not certain. The cards are still out as to whether Oryx gets subsumed into another open source project, goes away entirely, or perhaps becomes the Apache-level focal point for the work that Owen says needs to happen to get Hadoop the kind of real-time operational analytics that he says is required.
Whatever the future holds, chances seem good at this point that much of the operational analytics will be running within Apache Spark. Spark has its own suite of algorithms, the MLlib library, although it’s a bit green still. Owen says he’d love to see Oryx rewritten to run atop the Spark framework.
“We do tend to suggest people look at Spark going forward for a number of reasons,” Owen says. “It’s better for some kinds of machine learning…It’s a more general platform for doing a lot of related tasks around machine learning, like data transformation, and even some of the real-time stuff as well. Spark is going to be a good base for the model building. I certainly would rather stop putting energy into a separate project for model building, and put it into Spark, and then rewrite whatever we have in terms of Spark.”
Oryx is currently hosted at GitHub at github.com/cloudera/oryx.