“Those who can’t remember the past are condemned to repeat it,” argued George Santayana in his work The Life of Reason. If you cannot learn from your past mistakes, you will be ill-prepared to make good decisions going forward. In order to make better decisions in the future, it’s important to understand the historical context and conditions leading up to the decisions that were made in the past.
When it comes to machine learning, it means understanding what temporal context your models will exist in, when they will make predictions, and what information is available to your models when they make a prediction. For ease, I refer to this as “prediction time” and understanding what it is, how it’s used, and how it can be misused is vital when building machine learning models for production. Understanding when your models make predictions, and how to best support that, can help you and your team make more effective machine learning models the first time through.
Machine learning models don’t exist in a vacuum. Models exist to solve problems and to provide actionable intelligence. In order for a model to be useful, it has to provide solutions or answers in time for those answers to be used. This is what “prediction time” is all about.
A model’s prediction time is when the model will be expected to make prediction relative to what it’s predicting. Understanding prediction time, when things have occurred, and when the predictions are being made is vitally important for systems that are dynamically changing over time. The closer in time your prediction is to the outcome it’s predicting, the more accurate the prediction will be. At the same time, the further in advance your model is making the predictions, the more useful the predictions can be.
To examine this, think about using a trip planner to predict how long it’ll take to get somewhere. Your prediction right now can take in current traffic and weather conditions and can be quite accurate. If you made your prediction yesterday, you wouldn’t be able to use the exact traffic or weather conditions, but you’d be better able to plan your day in advance, for instance to plan to leave earlier. The former has a prediction time that is the same as the time of the trip, whereas the other has a prediction time that is one day prior. These are two different models, which solve two different (yet similar) use cases.
In order to train these models, we need to calculate historic feature values–data representing what the conditions were in the past. For each of these models, we need to understand what information we had available when we would have made those predictions in the past. If we built our models purely using our understanding of the “time of departure,” then we might be tempted to use the actual traffic conditions in our model with prediction time one day in advance. While we’re training the model, it would appear to be as accurate as its counterpart.
Unfortunately, if we attempted to take it to production and use it to make actual predictions, those predictions would be unreliable and horribly off (if it was able to make predictions at all), since it would either not know what the actual traffic conditions were, or it would use the incorrect traffic conditions.
It’s important to note that prediction time is a model specification; it’s something that is, to an extent, under our control as data scientists. It reflects how we’re choosing to frame the problem.
That’s part of what makes dealing with this particular problem difficult to deal with–the method of finding a solution isn’t straightforward, and it can’t be derived from a mathematical formula or engineering decision. In order to solve these problems, we require product, business and domain understanding. To be successful, we need to know how and where the models are being used, who is using them and what problem they’re trying to solve. These kinds of questions fit product design and product thinking, and as data scientists we often aren’t taught how to think in those terms.
Taking the time to understand the problem and context is an important step in building the right models the first time. Once we’ve made the product decision of what kind of model we’re building, the next challenge is in understanding what information actually is available to us in that time and context.
Understanding what information is available to us when making predictions is often trickier than it seems. Seldom do features or data sources come with warnings as to when they are valid. Many databases aren’t designed for looking up and calculating historic values (particularly production databases), so they end up providing anachronistic feature values which taint the training of our models. Sometimes features and data sources are more apparent (such as current traffic conditions for a model that is predicting further into the future) but other times, we can leak information in the aggregate. This typically happens either as a result of normalizing across a dataset, or not temporally splitting training and evaluation sets.
If I was predicting the value of a stock, but chose to represent each value as a percentage of the maximum value, then I’d be signaling to each data point where it sits in relation to everything else. That is, when a stock’s value is at its peak, we would know that it’s the peak. Conversely, the minimum would know it’s the minimum, which is not information our model would know in the real world.
The other issue arises when you don’t respect time in separating your datasets. Often when training models, you randomly split your data into training and test sets in order to do this training and evaluation. Unfortunately, when time gets involved, this becomes trickier. The train and test sets imply that those datasets are completely independent of one another.
In the context of time however, this can be a challenge. If I have repeated observations of the same general entity, I can’t train on an example for Wednesday, and then test it on an example for Tuesday. My prediction for Wednesday will have some information about Tuesday retained in it, effectively allowing my prediction for Tuesday to cheat by already having learned the answers.
Real-time models not only have to deal with understanding the time at which they are predicting things, but also have to make sure that they are not contaminated by secretly providing knowledge of the future in aggregate.
This is where the other major challenge comes into play. Many databases are designed to handle current information. They are updated, and edited, records added and records dropped. It’s a lot of extra work to track each of these changes in such a way to be able to recreate what the values were at any given point in time.
Additionally, retaining all this information at the drop of a hat can make the systems slower, which is especially problematic for many production systems trying to serve webpage and other various transactional content. Since these systems weren’t built with ML in mind, the ability to calculate historic feature values is often either painful or non-existent. Complicated queries and transformations have to be imposed to generate the historic values, if the data itself hasn’t been lost. Alternatively, if a new data is needed, then it often takes months of new data collection, since the ability to backfill is often too expensive to be feasible (again, if it’s possible).
Building models for systems that exist in a constantly changing world is difficult–partly due to gaps in training of data scientists, and partly due to tools and systems that weren’t built to properly handle data in a correctly historic fashion. Before you dive too far into model building, make sure that you’re asking your team: “Are the models that we’re building taking into account when they will be used in the real world?” and “Have we properly recreated the historic context?” Finally: “Are we learning from history, or are we preparing to repeat it?”
About the author: Max Boyd is a Data Science Lead at Kaskada, helping guide the development of a commercially available feature engineering platform. A Seattle native, he earned degrees in Statistics and Applied Mathematics from the University of Washington in Seattle. He has built and deployed models as a Data Scientist and Machine Learning Engineer at several Seattle-area tech startups in HR, finance and real estate.