Test Set Leakage

Test Set Leakage

Leakage occurs when information from outside the training set (typically the test set) inappropriately influences the model training process, leading to unrealistically optimistic performance estimates that won’t hold in production.

Leakage of information about the test set (data or what models work well) to the train set/test harness is called “test set leakage” and may introduce an optimistic bias: better results than we should expect in practice.

This bias is typically not discovered until the model is employed on entirely new data or deployed to production.

Data Preparation Leakage

Any data cleaning (beyond removing duplicates), preparation, or analysis like tasks performed on the whole dataset (prior to splitting) may result “data preparation leakage”.

You must prepare/analyze the train set only.

Is there evidence that knowledge from the test set leaked to the train set during data preparation?

  1. Are there duplicate examples in the test set and the train set (e.g. you forgot to remove duplicates or the train/test sets are not disjoint)?
  2. Did you perform data scaling on the whole dataset prior to the split (e.g. standardization, normalization, etc.)?
  3. Did you perform data transforms on the whole dataset prior to the split (e.g. power transform, log transform, etc.)?
  4. Did you impute missing values on the whole dataset prior to the split (e.g. mean/median impute, knn impute, etc.)?
  5. Did you engineer new features on the whole dataset prior to the split (e.g. one hot encode, integer encode, etc.)?
  6. Did you perform exploratory data analysis (EDA) on the whole dataset prior to the split (e.g. statistical summaries, plotting, etc.)?
  7. Did you engineer features that include information about the target variable?

Test Harness Leakage

On the test harness, we may use a train/validation set split, k-fold cross-validation or similar resampling techniques to evaluate candidate models.

Performing data preparation techniques on the entire train set prior to splitting the data for candidate model evaluation may result in “test harness leakage”.

Similarly, optimizing model hyperparameters on the same test harness as is used for model selection may result in overfitting and an optimistic bias.

  1. Did you perform data preparation (e.g. scaling, transforms, imputation, feature engineering, etc.) on the entire train set before a train/val split or k-fold cross-validation?
    • Prepare data for the model on the train set/train folds only.
  2. Did you tune model hyperparameters using the same train/validation split or k-fold cross-validation splits as you did for model selection?

Model Performance Leakage

Any knowledge of what models work well or don’t work well on the test set used for model selection on the training data/test harness may result in “model performance leakage”.

Is there evidence that knowledge about what works well on the test set has leaked to the test harness?

  1. Did you evaluate and review/analyze the results the chosen model on the test set more than once?
  2. Did you use knowledge about what models or model configurations work well on the test set to make changes to candidate models in evaluated on the train set/test harness?