How To Avoid the Technical Debt of Machine Learning
Machine learning provides us an extremely powerful mechanism for building personalized data-driven applications. However, the power doesn’t come without costs or risks. Unfortunately, the downside of long-running machine learning pipelines is all too often hidden from view.
Google (NASDAQ: GOOG) recently explored these topics in a paper titled Machine Learning: The High-Interest Credit Card of Technical Debt, which has generated some interesting discussion among users of machine learning.
As practitioners we are building tools and models to solve business problems, without thinking too far about maintenance of these models. Those of us who have been running machine learning pipelines in production for a few years have definitely encountered the outlined problems. The Google paper provides some excellent recommendations for building production pipelines without accumulating technical debt. Here are a few other solutions to the debt problem, as well as general best practices for implementing machine learning models.
Use Source Control
The first step of any software development project is to set up a source code repository, and same applies to machine learning projects. It is incredibly easy to get started with a interactive notebook today. Using Apache Zeppelin, or Spark notebook, a data scientist is able to train a model right within a web browser.
The problem is the code to generate the model, along with all assumptions, remains locked within the notebook. All changes to the model beyond simple prototypes need to be checked in to a source code repository, and hopefully peer reviewed by the team
Operate Over Immutable Data
Its great to have the code in source control, but if the data the model is trained on is mutable, the code could be deceitful. Mutable data could be a view generated from source data by some external ETL process. The systems to maintain version control for data are not commonplace yet, so we have to rely on conventions. The machine learning project should operate over source log data that is agreed to be append-only.
This way a model could be retrained at any time in the future using the source code and source data, and would be exactly the same as was generated originally (unless it relies on randomization, in which case one would expect it to be slightly different, but maintain the same properties). Any feature generation or transformation steps should be included in the pipeline code as dependencies.
The model has been trained, and put into production, but perhaps the labeled data (in the case of supervised learning) keeps getting added to the system. One can set up a process to rerun validation step on the new data and report typical metrics (accuracy, precision, and recall), as well as model specific metrics. Any significant deviation from expected results could mean a problem with the data or the model, and should trigger further investigation.
Data is always changing – new product features are added, old ones removed, marketing campaigns bring in different groups of users to the product, and so on. A static model trained once is not going to capture these changes. Therefore it is important to retrain the model on new data on a regular basis – assuming you’re not using an online model – while keeping an eye on the model metrics.
If retraining yields a lower accuracy, precision, or recall, perhaps some assumptions or parameters need to be revised.
Data scientists want to work on new and exciting problems, not rehashing the same model over and over. So it is possible, and even healthy, to automate retraining, publishing, and monitoring the pipeline as much as possible. Typical software engineering best practices, such as continuous integration & continuous delivery with automated testing apply here. A single parameter change committed to version control should trigger a deployment pipeline that would retrain the model, verify the results, upload the model to production, and rolling back if the metrics show a problem.
Using these techniques along with the ones described in the paper should increase the quality of machine learning pipelines you put into production, and a lot fewer head-scratchers when you suddenly discover the pipeline has been predicting wrong results for several weeks.
About the author: Dan Osipov is a principal consultant for Applicative LLC focused on helping companies tackle data engineering challenges. His expertise includes building pipelines and streaming systems. He sometimes tweets at danosipov