March 9, 2016

How To Avoid the Technical Debt of Machine Learning

Dan Osipov

Machine learning provides us an extremely powerful mechanism for building personalized data-driven applications. However, the power doesn’t come without costs or risks. Unfortunately, the downside of long-running machine learning pipelines is all too often hidden from view.

Google (NASDAQ: GOOG) recently explored these topics in a paper titled Machine Learning: The High-Interest Credit Card of Technical Debt, which has generated some interesting discussion among users of machine learning.

As practitioners we are building tools and models to solve business problems, without thinking too far about maintenance of these models. Those of us who have been running machine learning pipelines in production for a few years have definitely encountered the outlined problems. The Google paper provides some excellent recommendations for building production pipelines without accumulating technical debt. Here are a few other solutions to the debt problem, as well as general best practices for implementing machine learning models.

Use Source Control

The first step of any software development project is to set up a source code repository, and same applies to machine learning projects. It is incredibly easy to get started with a interactive notebook today. Using Apache Zeppelin, or Spark notebook, a data scientist is able to train a model right within a web browser.

The problem is the code to generate the model, along with all assumptions, remains locked within the notebook. All changes to the model beyond simple prototypes need to be checked in to a source code repository, and hopefully peer reviewed by the team

Operate Over Immutable Data

Its great to have the code in source control, but if the data the model is trained on is mutable, the code could be deceitful. Mutable data could be a view generated from source data by some external ETL process. The systems to maintain version control for data are not commonplace yet, so we have to rely on conventions. The machine learning project should operate over source log data that is agreed to be append-only.

This way a model could be retrained at any time in the future using the source code and source data, and would be exactly the same as was generated originally (unless it relies on randomization, in which case one would expect it to be slightly different, but maintain the same properties). Any feature generation or transformation steps should be included in the pipeline code as dependencies.

Monitor Performance

The model has been trained, and put into production, but perhaps the labeled data (in the case of supervised learning) keeps getting added to the system. One can set up a process to rerun validation step on the new data and report typical metrics (accuracy, precision, and recall), as well as model specific metrics. Any significant deviation from expected results could mean a problem with the data or the model, and should trigger further investigation.

Retrain Regularly

Data is always changing – new product features are added, old ones removed, marketing campaigns bring in different groups of users to the product, and so on. A static model trained once is not going to capture these changes. Therefore it is important to retrain the model on new data on a regular basis – assuming you’re not using an online model – while keeping an eye on the model metrics.

If retraining yields a lower accuracy, precision, or recall, perhaps some assumptions or parameters need to be revised.

Automate Everything

Data scientists want to work on new and exciting problems, not rehashing the same model over and over. So it is possible, and even healthy, to automate retraining, publishing, and monitoring the pipeline as much as possible. Typical software engineering best practices, such as continuous integration & continuous delivery with automated testing apply here. A single parameter change committed to version control should trigger a deployment pipeline that would retrain the model, verify the results, upload the model to production, and rolling back if the metrics show a problem.

Using these techniques along with the ones described in the paper should increase the quality of machine learning pipelines you put into production, and a lot fewer head-scratchers when you suddenly discover the pipeline has been predicting wrong results for several weeks.

About the author: Dan Osipov is a principal consultant for Applicative LLC focused on helping companies tackle data engineering challenges. His expertise includes building pipelines and streaming systems. He sometimes tweets at danosipov

Applications: Artificial Intelligence, Data Mining

Technologies: Middleware

Sectors: Financial Services, Retail

Vendors: Applicative

Tags: change management, google, machine learning, source control, technical debt

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

How To Avoid the Technical Debt of Machine Learning

Use Source Control

Operate Over Immutable Data

Monitor Performance

Retrain Regularly

Automate Everything

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

How To Avoid the Technical Debt of Machine Learning

Use Source Control

Operate Over Immutable Data

Monitor Performance

Retrain Regularly

Automate Everything

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link