Interventions

Interventions

So, there may be an issue with your test set, train set, or your model. Now what do we do?

Below are some plans of attack:

1. Do Nothing (but…)

Take the estimate of model performance on the test set as the expected behavior of the model on new data.

But, also consider best practices when presenting a final chosen model:

  • Present a performance distribution/confidence interval on the test set instead of a point estimate.
  • Present a final chosen model ensemble instead of a single final chosen model fit on the train set to reduce prediction variance.
  • Present the first, second, and third best models and contrast their performance-complexity trade-off.
  • Present a feature importance analysis to give ideas of what features contribute most/least to the chosen model.
  • Present an ablation analysis to show how performance improves with each additionally added complexity.
  • Present a scalability analysis to show how performance improves with the size of training data.
  • Present a development timeline analysis and plot all models/configurations you tried vs their performance over time as a line plot.

2. Fix Your Test Harness

In most cases, the fix involves making the test harness results less biased (avoid overfitting) and more stable/robust (more performance sampling).

  1. Use k-fold cross-validation, if you’re not already.
  2. Use 10 folds instead of 5 or 3, if you can afford the computational cost (consider a sensitivity analysis of k).
  3. Use repeated 10-fold cross-validation, if you can afford the computational cost.
  4. Use a pipeline of data cleaning, transforms, feature engineering, etc. steps that are applied automatically.
    • The pipeline steps must be fit on train data (same data used to fit a candidate/final model) and applied to all data before it touches the model.
    • See sklearn.pipeline.Pipeline
  5. Use nested cross-validation or nested train/validation split to tune model hyperparameters for a chosen model.
⚠️
Warning: After fixing the test harness, if you want to choose a different model/configuration and evaluate it on the test set, you should probably develop a new test set to avoid any test set leakage.

3. Fix Your Overfit Model(s)

Even with best practices, high-capacity models can overfit.

This is a big topic, but to get started, consider:

  1. Simplification: Reducing model complexity by using fewer layers/parameters, removing features, or choosing a simpler model architecture, etc. This limits the model’s capacity to memorize training data and forces it to learn more generalizable patterns instead.
  2. Regularization: Adding penalty terms to the loss function that discourage complex models by penalizing large weights. Common approaches include L1 (Lasso) and L2 (Ridge) regularization, which help prevent the model from relying too heavily on any single feature.
  3. Early Stopping: Monitoring model performance on a validation set during training and stopping when performance begins to degrade. This prevents the model from continuing to fit noise in the training data after it has learned the true underlying patterns.
  4. Data Augmentation: Artificially increasing the size and diversity of the training dataset by applying transformations or adding noise to existing samples. This exposes the model to more variation and helps it learn more robust features rather than overfitting to specific training examples.

Beyond these simple measures: dive into the literature for your domain and for your model and sniff out common/helpful regularization techniques you can test.

⚠️
Warning: After fixing your model, if you want to evaluate it on the test set, you should probably develop a new test set to avoid any test set leakage.

4. Fix Your Test Set (danger!)

Perhaps the test set is large enough and data/performance distributions match well enough, but there are specific examples that are causing problems.

  • Perhaps some examples are outliers (cannot be predicted effectively) and should be removed from the train and test sets?
  • Perhaps a specific subset of the test set can be used (careful of selection bias)?
  • Perhaps the distribution of challenging examples is not balanced across train/test sets and domain-specific stratification of the dataset into train/test sets is required?
  • Perhaps a weighting can be applied to samples and a sample-weighting aware learning algorithm can be used?
  • Perhaps you can augment the test set with artificial/updated/contrived examples (danger)?
⚠️
Warning: There is a slippery slope of removing all the hard-to-predict examples or too many examples. You’re intentionally biasing the data and manipulating the results. Your decisions need to be objectively defensible in front of your harshest critic.

5. Get New Data (do this)

Perhaps the test set is too small, or the data/performance distributions are significantly different, or some other key failing.

Here are your best options:

New Test Set

  1. Discard the current test set.
  2. Gather more data.
  3. Use the new data as the new test set.
    • Ideally use some distribution and/or performance checks to develop confidence in the test set before adopting it.
  4. Develop a new unbiased estimate of the model’s generalized performance.

New Train/Test Sets

Alternately, and perhaps more robustly:

  1. Discard all data (train and test sets).
  2. Gather more data.
  3. Split the new data into new train/test sets.
  4. Develop a new unbiased estimate of the model’s generalized performance.
    • Ideally you would perform model selection again, anew.

6. Reconstitute Your Test Set

Sometimes acquiring a new test set is not possible, in which case:

  1. Combine train and test set into the whole dataset again.
  2. Develop a new split into train/test sets.
  3. Develop a new unbiased estimate of the model’s generalized performance.
    • Ideally you would perform model selection again, anew.

Although risky, this is common because of real world constraints on time, data, domain experts, etc.

⚠️
Warning: There is a risk of an optimistic bias with this approach as you already know something about what does/doesn’t work on examples in the original test set.