Follow Datanami:
March 3, 2015

The 3 Key Steps to Building a Predictive App with Machine Learning

Alice Zheng

Machine learning is the technology that allows businesses to make sense of vast quantities of data, make better decisions, and ultimately bring better services to consumers. From personalized recommendations to fraud detection, from sentiment analysis to personalized medicine, machine learning provides the technology to adapt services to individual needs.

For all the value that it brings, machine learning technology has a high cost. Building a predictive application is a multi-stage and iterative process that requires a plethora of people, systems and skill sets. It can take months of collaboration between data scientists, software engineers and systems architects working together to build, deploy, and maintain a machine learning system that supports predictive applications. The relevant question today is, “Can we make the process of building a predictive application much faster and painless?”

Building a Predictive Application

A predictive application is an app that can make predictions about the future via learned patterns from past data. Machine learning techniques and models are often employed to learn those patterns. For instance, a personalized recommender might involve a collaborative filtering model that learns “users who like ‘The Hobbit’ often watch ‘Star Wars’,” and recommends the other movie to users who have only watched one.

Building a predictive application is like driving a car. Black box APIs can get you to 40 mph in very little time: One can get a fully functional app quickly, but it cannot be customized to specific datasets, therefore it is difficult to push beyond 40 mph.

In order to get to 80 mph, one has to put in a lot more effort.  For instance, in order to build an accurate recommendation engine, the machine learning model needs to be tailored to the specific characteristic of the end user and the business.  Some models are faster but less accurate; others are fast but produce recommendations whose rationale is difficult to understand; still other models may be fast and easy to interpret but don’t work as well for rare items or users. Picking the right model and customizing it for a dataset requires time, expertise and computation power.

The good news, however, is that a new generation of machine learning platforms is on the horizon. These systems promise to take the user from 0 to 60 mph, with minimal effort and expertise. They are easy to use, require less machine learning expertise to get started, are system independent, and are tightly integrated with popular data stores and production environments.

There are roughly three stages in building a predictive application: data engineering, data intelligence, and deployment. Each stage has a different set of technical challenges and requires different tools.

Data Engineering

This stage encompasses all operations between raw data ingestion and predictive model building.  It includes cleaning the data and transforming raw data into what are known as “features” – numeric or categorical values that describe useful attributes of the data. For instance, if the task is to recommend movies, a useful feature might be the name of the movie director mapped to a uniquemachine learning_2 ID.

Data engineering often involves database operations such as join, groupby, sort, or indexing as in the example above. But it doesn’t stop there. Modern machine learning methods often require statistical features that are beyond simple database operations. For instance, a text document is often converted into a collection of frequency counts of the unique words that appear in the document–what is known as the bag-of-words representation.

The computational challenge with data engineering lies with the large amount of raw input data, often distributed across many nodes. While some data engineering operations are highly data parallel (e.g., computing the total number of click-throughs for an ad), others require communication between all nodes (e.g., sorting user IDs by their creation time). Hence, the data engineering stage must efficiently sift through large input data, often in a distributed fashion.

Many tools exist for data engineering. Databases such as SQL server and Apache Hive provide common database operations. For more advanced statistical operations, one can use Pandas dataframes, Scikit-learn feature transformation pipelines, or numerous packages in R. A new machine learning platform from my company handles medium to large amounts of data and gracefully extends from in-memory to on-disk computation. Apache Spark’s Resilient Distributed Dataset (RDD) data structure offers fully distributed, large-scale data engineering.

Data Intelligence

Once raw data is transformed into features, they are ready for the next stage: building predictive models, which we are calling “data intelligence.” There are many kinds of machine learning models, and there are different ways of learning or training those models. Different models and model training methods have different computational demands and data access patterns. Some methods are easier to parallelize than others. Some scans through the data linearly, while others require full random access. No matter what model or training method one uses, the key to this stage of the game is fast iteration: trying out many models quickly, settling on one that best represents the data and makes accurate predictions.

recommendation systemThe task of training many different machine learning models on the same dataset is highly parallelizable, and is often done in a distributed fashion. Training a single machine learning model, however, often involves computations that are not data parallel, which means it is difficult to distribute and is much better done on a single machine. Fortunately, featurized data is much smaller than raw data, therefore much more feasible for single machines. Even when that is not the case, one can resort to taking a random subsample of the data. Thus the basic computational paradigm in the data intelligence phase is Medium Compute, not Big Compute.

A variety of tools are available for building machine learning models. R and Scikit-learn both offer a wide variety of machine learning models. Vowpal Wabbit is an open source command line program for a subclass of models known as “online learning.” GraphLab Create, Azure ML and H2O are newer platforms that offer a variety of models. GraphLab Create has a multi-tiered API that spans from the expert user to the novice user. H2O offers a web interface for model building. All of the above packages also offer multiple model evaluation methods that allow the user to understand how well the model is performing.


Deployment involves making the model available to serve incoming requests. The challenges here revolve around scale, latency, performance monitoring and incorporating real-time feedback. Predictions need to be made quickly to many end users. The app needs to be monitored during usage for unexpected drops in accuracy, throughput, or long latency. The models also need to track end-user response, so that it can be improved over time; this requires incorporating feedback and updating the deployed models.

Microsoft’s Azure ML builds and deploys machine learning models on Azure cloud. In the Python ecosystem, yHat offers deployment of Scikit-learn models with a RESTful API. Dato’s GraphLab Create offers a RESTful, distributed predictive service on Amazon EC2 cloud. Revolution R, recently acquired by Azure ML, offers model deployment in the R ecosystem.

Where do we go from here?

As more businesses turn to predictive applications as the next frontier, next-generation machine learning platforms strike a balance between ease-of-use and customizability. From data engineering to data intelligence and deployment, these tools promise to bridge the gap between prototyping and Alice Zhengproduction, and gracefully transition from Big Data and Big Compute to Medium Data and fast iteration. Get ready for the future.

About the author: Alice Zheng is Director of Data Science at Dato (formerly known as GraphLab) where she helps the company in build a fast and scalable machine learning platform for predictive applications. Prior to Dato, Alice was a researcher in the Machine Learning Group at Microsoft Research, Redmond. Before joining Microsoft, Alice was a postdoc at Carnegie Mellon University. She received her B.A. and Ph.D. degrees from U. C. Berkeley.

Related Items:

The Rise of Predictive Modeling Factories

Outsmarting Wine Snobs with Machine Learning

Inside Cisco’s Machine Learning Model Factory