Staying On Top of ML Model and Data Drift
A lot of things can go wrong when developing machine learning models. You can use poor quality data, mistake correlation for causation, or overfit your model to the training data, just to name a few. But there are also a few gotchas that data scientists need to look out for after the models have been deployed into production, specifically around model and data drift.
Data scientists pay close attention to the data they use to train their machine learning models, as they should. Machine learning models, after all, are simply functions of data. But the work is not over once the models are put into production, as data scientists must monitor the models to be sure they’re not drifting.
There are a few forms of drift that can throw a wrench into predictive analytics projects. Data scientists should be on the lookout for them to reduce the odds that they hurt their work.
For starters, there is concept drift. This happens when the value that a data scientist is measuring materially changes. Think of fraudsters who change their techniques in order to evade detection. Data scientists need to change their models to account for the fraudster’s new techniques.
The data itself can also drift. For example, during the COVID-19 lockdown, customer buying patterns changed dramatically. The buying signals that companies in the consumer goods supply chain rely on to increase or decrease supply and to set prices were in disarray, leading to what many call “the new normal.” Physical sensors, such as thermometers, also periodically need to be recalibrated to ensure accuracy.
There can also be technical glitches in the data collection process that can cause the data to become skewed. Perhaps the sampling frequency changed suddenly, or somebody introduced a new way to measure something and forgot to tell their friendly neighborhood data scientists (big mistake).
Generally, data scientists are aware that their models will become stale over time, and that they need to be retrained on newer and fresher data to maintain accuracy levels. All models need to be periodically retrained, and in general, the more often a model is retrained, the better.
But relying on more frequent model retraining can’t entirely eliminate the problem of model or data drift, says Nick Elprin, CEO and co-founder of Domino Data Lab, which develops a platform that help teams of data scientists collaborate on the development and deployment of machine learning models.
“I think just saying retraining frequently is good or better is a bit of an oversimplification,” Elprin says. “I think what’s really valuable is data scientists having an informed and contextual feedback loop about what is happening with their models in the real world. So just doing a blind retrain withing knowing how have our features changed and shifted? Maybe there’s new and important information that needs to be folded into our training sets.”
Domino recently rolled out a new component of its platform aimed at helping its users be aware of the problem and to create solutions for drift. Called the Domino Model Monitor, or DMM, the software automatically keeps a watch on the behavior of models and certain qualities of the live data used to generate predictions in production.
The Domino platform has always kept logs of model behavior, which customers could use to build their own monitoring tool, but it hasn’t automated the monitoring on behalf of customers until it developed DMM. According to Elprin, the software uses a variety of statistical checks to detect if drift is happening, and if it is, to tell if it’s statistically significant enough to impact predictions in a meaningful way.
It also provides an exploratory visualization tool that allows data scientists to drill into the specific features used in the model to determine how the drift is happening. It’s all about helping data scientists to develop a better understanding of what’s going on with the data and with their model or models, Elprin says.
“This is the thing about models: Even just one of them can be extremely important and extremely valuable, so even just one model that’s drifting can cause a massive impact to a business,” he says.
Domino Data Labs focuses on “code first” data scientists. Its platform provides structure that allows them to their favorite tools, like Python or R frameworks, in a managed and collaborative environment. About 20% of the Fortune 100 are Domino customers, according to Elprin, and its largest customer has close to 1,000 data scientists using the product. DMM works with models that were developed and deployed in the Domino platform, but it also works with models developed with other tools.
The biggest and most advanced customers have the biggest concerns about model and data drift and the impact that can have on their businesses, Elprin says.
“The customers farthest along the journey and the adoption curve have a bunch of models in production, driving mission-critical use case and process,” he says. “That creates a whole bunch of risk, because models are probabilistic. Their behavior can change simply if the world around them evolves and changes.”
“Being aware of and managing model drift and detecting model performance degradation is a critical problem that a growing number of companies are starting to become aware of, he continues. “Because it’s a fairly new problem for the world and the market, there haven’t been solutions to deal with this in the past. That’s what Domino Model Monitor is all about: helping companies early on detect model drift before it creates financial loss or degraded customer experience as the models change performance.”
The San Francisco company has been doing well lately, according to Elprin. While the COVID-19 pandemic has forced the company’s employees to work from home, it hasn’t stopped it from completing a $43 million Series E round, which was announced last week along with DMM. Among the new investors was Dell Technologies, which is both an investor as well as a customer.
As if DMM and a significant round of funding was not enough news for one day, the company also announced version 4.2 of the Domino platform. In addition to DMM, the new release brings support for on-demand Spark clusters; new data science project management; and support for additional Kubernetes distributions.
“It feels like we’re on a tear,” Elprin said in a Zoom call last week. “We are, as far as I know, the only open, enterprise grade data science platform that really serves the need of really large organizations.”