What’s the right time to train ML models, again?

9 min readAug 27, 2021

By Davide Fiacconi — Data Scientist in Radicalbit

Datasets change over time, and models should adapt too

Building a performing Machine Learning model requires a significant amount of time to experiment. A data scientist tries different algorithms or different feature engineering strategies before getting the model right. Once the model is tuned to its best, it may be time to serve it in the production environment.

However, when data scientists optimise their models, experiments are run against frozen datasets that do not evolve over time. The key assumption is that such a dataset is representative of the data distribution that the model will encounter at prediction time. Unfortunately, this assumption might not be always guaranteed when the model is moved to production. Indeed, the validity of it depends on the data generation process and whether the latter evolves over time. If the data comes from users interacting with a UI, for instance, changes in the UI itself or in the interaction patterns may drive the evolution of the data distribution.

Therefore, monitoring the performance of a model and, at the same time, identifying potential drifts in the data is of paramount importance for checking the health status of a model in production and a pillar of MLOps best practices.

Drift modes and Detection Strategies

We call drift the general evolution over time of a data distribution. It is possible to identify 3 root causes for drifts.

Data drift: data drift happens when the distribution of the input data the model is fed with changes over time.
Concept drift: concept drift occurs when the definition of the target value the model is trying to predict changes over time.
Upstream changes: upstream changes are not conceptual data changes; the term refers mostly to technical issues that might occur in the upstream phases of data handling (data collection, data preprocessing) and may lead to changes such as data schema or type changes, or the missing of some of the features to be collected.

Neglecting upstream changes, the real challenge of an efficient Machine Learning system in production is to cope with data and concept drifts. Typically, the model during evaluation is “static”. By static, we mean that its internal parameters, learned by means of the training process, do not change. Therefore, in case of data drift, it is likely that the model performances may drop whenever the input data has changed so much to appear as something previously unseen by the model. Moreover, the prediction concept a model has learnt during training is tight to the model’s parameters. Therefore, when concept drift occurs, it will keep providing predictions still bound to the original concept, and therefore in disagreement with the new expectations. In real-world applications, nothing prevents data and concept drift to occur at the same time, making them non-trivial to disentangle.

In order to tackle the drift problem, various algorithms have been devised. Each of them has its own advantages, but they are all generally tailored to identify variations over time in data distribution or in any associated statistics (e.g. the mean). The idea is to look at subsequent data windows, defined either by the number of datapoints or with time intervals, in order to have an “old” and a “recent” data sample to compare.

But how do such algorithms work? We can acknowledge two different approaches.

Looking at the input data only. The idea is to directly look at the stream of input data and try to detect changes by looking at the evolution of some statistics. This kind of models directly address the data drift issues and comprise techniques such as the ADaptive WINdowing (ADWIN) algorithm.
Evaluating model predictions. The idea is to look for drifts in the stream of evaluations of the model predictions. In other words, whenever a model receives an input data and makes a prediction, the observed ground truth of that prediction is provided in order to build the corresponding stream of the model errors. This latter stream is then investigated for drifts, with the underlying expectation that a data or a concept drift would likely induce a drop in model performances, hence more errors. The Drift Detection Method (DDM) algorithm and any variation of it rely on this approach.

Simulated scenarios to test drift detection methods

Let us consider the ADWIN and the DDM algorithms as representatives for the first and second category, respectively. It is interesting then to test how these approaches work to detect drifts and what are their best use cases. In order to do so, we have simulated a few synthetic univariate datasets that represent different ways drifts may occur and manifest, and we have run the two algorithms against each of them to see whether and after how long they could identify the drifts.

Testing ADWIN algorithm

We describe the test cases for the ADWIN algorithm. The figure below shows the first synthetic dataset as the value as a function of the number of records in the stream.

It depicts a scenario where a variable x has values oscillating around a constant reference, and both the reference and the amplitude of the oscillation change in correspondence of the green dashed lines that indicates the data drifts.

A second synthetic dataset is shown below. The main difference with respect to the first one is that the reference value has a trend that changes over time, idealizing a scenario where the x variable represents for instance a time series with varying trends. As before, the vertical green lines mark the data drift moments. Finally, note that since ADWIN works on the input data, there are no explicit constraints on the range of values, although in these examples we kept the numbers between 0 and 1 for simplicity.

We can then apply the ADWIN algorithm to these two test cases. The results are shown in the image below, where the red vertical lines mark the detected drifts. Without going into the details of how to set the “sensitivity” parameter of the algorithm, it is yet instructive to compare the results on the two test cases, since they are indeed mostly driven by the characteristics of the two datasets themselves.

We can immediately notice that the algorithm works very nicely with the first dataset, promptly identifying the drifts moments, while it comes up with too many putative drifts in the second dataset. This difference can be understood by recalling that the ADWIN looks for variation over time of the mean of the data. In the first dataset, the mean of the data is constant despite the oscillations of the values, and it changes indeed in correspondence of the drifts. On the other hand, the trends introduced in the second datasets are such that the mean changes over time. In a sense, the algorithm is then picking the change right; however, the two datasets depict two conceptually different scenarios. In the first case, the sequential order of the data (in between each drift) is not relevant, while in the second case, the sequentiality contains part of the information (i.e. the underlying trend), and it does make a difference if data are shuffled. To give an example, the first case is akin to a sequence of customer profiles to be evaluated for churn probability, where it does not make difference to evaluate customer A before customer B or vice versa, while the second case is like stock market data, which are time-series data, and therefore the order makes a difference to whether the value is rising or falling. Therefore, the underlying trend in the second scenario should not be interpreted as a data drift.

Testing DDM algorithm

The figure below shows the test dataset used for the DDM algorithm. Since the latter evaluates not directly the data, but the corresponding stream of errors made by a putative model that is doing inference and is receiving the ground-truth feedback, the structure of the dataset is different from the ADWIN case.

The thick green line shows the error rate of the putative model. A drift corresponds to an increase in the error rate. At the beginning, the model has a 5% error rate (a good model); then, a drift occurs, and the error rate suddenly jump to 40%; then, a new drift occurs, and the error rate goes up again to 60%; finally, the last drift occurs, and the error rate becomes as bad as 95% (the model is most of the time wrong). The blue dots are the stream of the model results, namely a value of 1 when the model makes a mistake, and a value of 0 when the hypothetical prediction is correct. They have been randomly sampled in proportion consistent with the error rate in between each drift (e.g. 5% of 1’s and 95% of 0’s when the error rate is 5%), as confirmed by the orange line, which is the a-posteriori error rate inferred from the sampled data, consistent with true underlying one.

So, with the dataset at hand, we can evaluate the performance of the DDM algorithm. This is shown in the image below.

As before, the detected drifts are marked with red vertical lines. The algorithm can promptly detect the occurrences of drifts in the underlying model performances. Again, the scenario here is: there is a putative model that receives a stream of data and creates a corresponding stream of predictions. Those predictions are checked against the ground truth, i.e. there should be a way to validate the correctness of such predictions. Therefore, we have the stream of the model errors used by the DDM algorithm to look for the occurrence of any drift in the model performance.

What is Radicalbit doing about it?

So, what can we conclude? Provided that model monitoring is fundamental for any machine learning system in production, we have to realise that there is not a technique that can deal with every circumstance. Referring to the two types of approaches analysed above, we can see they both have advantages and disadvantages.

Looking at the input data only. This monitoring strategy is easy to implement, because it requires only the input data to the model, and it does not rely on building a corresponding process to collect the assessment of the model prediction, therefore being useful even in cases where the model inference is hard to be verified directly. On the other hand, it might be hard to adapt to every kind of input data, and it works better with time-unaware data, i.e. data whose ordering does not provide information, and not with e.g. time-series.
Evaluating model predictions. This monitoring strategy is directly tied to the quantitative performances of the model, and therefore it generally is more precise in detecting drifts or any unexpected “misbehaviour” of a model. However, it might be non-trivial to collect the necessary feedback to verify the correctness of the model predictions, which is mandatory to adopt this strategy. Moreover, for certain kinds of applications (e.g. regression tasks), it may be less obvious to define whether the model is “right” or “wrong” in a dycothomic way.

At Radicalbit, we are upgrading our MLOps platform to incorporate those algorithms and some variations of them in order to better adapt to different situations. We are also working to combine them with different techniques in an effort that mixes R&D and product strategy in order to provide a flexible platform that makes model monitoring efficient and effective.