Follow Datanami:
February 9, 2015

The Rise of Predictive Modeling Factories

Alex Woodie

So you installed Hadoop and built a data lake that can store petabytes of data. Now what? According to leaders in predictive analytics, the best thing you can do is to build a predictive model factory that automates much of the drudgery out of running machine learning algorithms at scale.

“Every data lake needs a predictive modeling factory,” says SriSatish Ambati, the co-founder and CEO of H2O, a developer of in-memory machine learning technology. “Predictive analytics as a whole is burgeoning. It’s beginning to dawn on the world that we can actually use this data to predict the future. And the next level of that is how do I automate [predictive analytics] without having to do a bunch of manual stuff.”

In Ambati’s view, the rise of predictive modeling factories will eliminate the need for data scientist to do the boring management of data and models, and instead focus his effort on what really matters: Figuring out what are the right questions to ask the predictive models.

“The model factories bring real automation to a space that was historically manual,” says Ambati, who also co-founded Platfora. “I think today the vast bulk of data science work is trying to move files from one tool to another, and trying to shape different data sets to manually run them through models. The first priority is to remove the human error, but it’s also really boring and painful stuff. The really hard stuff, which is asking good business questions, is still a human endeavor.”

Data scientists today often struggle to manage predictive models that number in the dozens or hundreds. But in the future of model factories, software will manage the models for the data scientist, and thereby allow them to run ensembles of models numbering in the tens of thousands. That will allow organizations to iterate more quickly and get better and more detailed predictions, just like Google and Facebook, but without the army of engineers.

“We need to push for higher accuracy, for more models than what’s possible running them manually,” Ambati tells Datanami. “I think generating thousands of these models and then triangulating the true prediction is what we envision in this space.”

Modeling Monster Ensembles

Alex Gray, the CTO and co-founder of Skytree, agrees with the general concept of a predictive modeling factory. “People have been inching along with the previous generation of tools, kind of keeping it all in their heads, or often using a notebook,” he says. “Once you get to 100 [models], that’s where the whole system breaks.”

Later this week, Skytree will unveil a new release of its in-memory predictive analytic technology that brings new features in the areas of model management. The upgrade will automate many of the non-glamorous tasks that consume a lot of data scientist’s time, including tuning parameters and selecting the best modeling methods.

Gray, who is also an associate professor at Georgia Institute of Technology, is a bit befuddled why there hasn’t been more progress in the area of ensemble modeling or managing “monster” models. “It’s considered well-understood mathematically,” Gray says. “There’s a number of techniques for doing it.  But I would say it’s not mature from a software perspective, oddly enough. You can always do it manually. But it’s not actually mature from a software standpoint. If I want to make a big ensemble, I have to do a lot of manual work to do that.”crystal_ball

Skytree brings its own in-memory predictive analytics layer and proprietary machine learning algorithms to Hadoop, which the company is committed to as the platform for the big data stack. With the upcoming release, the company will be encouraging data scientists to re-think how they perceive predictive modeling in Hadoop.

“Right now data scientists have to think of many conceptual layers because there aren’t adequate tools that abstract away the lower layers,” Gray says. “Nobody really, at a machine learning level, wants to think about these low Hadoop layers. You do because you have to. Our approach is to abstract that so it just works. It does what you expect and hope what it does, which is simply to supply good computation performance at the right time. But other than that, you shouldn’t’ have to worry about where your data is and all that stuff.  You should be thinking of the modeling layer, model aggregation, and model management.  That’s the focus of our upcoming UI.”

Throwing Models at the Data

The rise of predictive model factories is not a foregone conclusion, but it appears to be taking shape alongside the rise of Hadoop and Spark, and the overall simplification of big data analytic tooling. As predictive analytics moves beyond early adopters, such as Wall Street firms doing algorithmic trading and credit card companies trying to gauge risk, we’re seeing the power of machine learning algorithms getting used in new ways.

Tye Rattenbury, a data scientist Trifacta, doesn’t believe that every step in the predictive analytics process can be automated. Much of the transformation steps on big data can be automated, of course–that’s where Trifacta’s tools come in. But in Rattenbury’s view, there is still a big need for human eyes to oversee the overall construct and debugging of modeling environments, especially the larger complex ones involving millions of features and variables.

But he does see machine learning technology becoming more pervasive. “You probably can’t automate the whole thing, but what we are seeing is these kinds of predictive models being applied much earlier in the process and that workload is shifting,” Rattenbury says. “That is something we are seeing, where people are analyzing the results more than they are thinking about the models to build up front.”

These days, it’s not uncommon for data scientists to use K-means and generalized linear modeling (GLM) algorithms just to pull structure out of a data set.

“Those kids of algorithms are just going to be thrown immediately at data sets and people are going to sit around and try to figure out what came out after running them,” he says. “You’re seeing companies like 0xdata [now known as H20] and GraphLab [now known as Dato), who are basically building these environments where you can just throw a bunch of models at a data set.”

What’s In a Name

The term predictive modeling factory may or may not take off, but it’s clear to Skytree’s Gray that something like it is definitely on the horizon. “It goes hand in hand with just using machine learning to try to do something critical in the business. You are going to generate many, many models,” he says. “Whatever you call it, it’s basically about getting serious and starting to execute on what a lot of companies come to us for, which is, in some sense, very explicitly, to transform the whole company to become data driven.”

In the mind of H20’s Ambati, the concept of a predictive modeling factory goes well beyond data exploration and firmly into the world of production. Being able to iterate quickly on thousands of models will require being able to train and score models simultaneously. This approach allows Cisco (an H20 customer) to run 60,000 propensity to buy models every three months, or to allow Google to not only have a model for every individual, but to have multiple models for every person based on the time of the day.

“Machine learning is the new SQL,” H20’s Ambati says. “With large data, we’re able to pick up signals a lot more loudly. There’s more noise in the data…but large data sets are necessary for getting realistic results out of algorithms. Machine learning is now mostly a side effect of having enough data, enough dimensions that you can actually pick up signals from noise.”

The rapid maturation of core machine learning technology, along with the explosion of data and the rise of new computer processing architectures like Hadoop and Spark, are coming together at this moment in time, and the possibilities are very exciting for folks like Ambati.

“There are substantial new innovations in the last 15 years that have lead to real improving in the underlying mathematics [of machine learning], and the hardware and software innovation from the computer science side, which allows us to process large data, in relatively faster times, whether it’s in the form of in-memory or faster processors or faster networks, where we can connect hundreds or thousands of machines,” Ambati says. “That’s the bigger underlying story of machine learning coming into prominence, is now we have the math and the data to go after real analysis.”

Related Items:

Inside Cisco’s Machine Learning Model Factory

Skytree Hangs Machine Learning Hat On Hadoop

How Spark Drives Midsize Data Transformation for Trifacta

Datanami