Data Distribution

Data Distribution

We assume that the train set and the test set are a representative statistical sample from the domain.

As such, they should have the same data distributions.

Is there statistical evidence that the train and test sets have the same data distributions?

⚠️
Warning: Do not inspect summary statistics, plots, or any exploratory data analysis (EDA) of train vs test sets. You may learn details of the test set that may influence data preparation, model selection, etc. introducing test set leakage and potential bias in the final model.

Checks

  1. Do numerical input/target variables have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
  2. Do numerical input/target variables have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
  3. Do categorical input/target variables have the same distributions (e.g. Chi-Square Test)?
  4. Do numerical input/target variables have the same distribution variance (e.g. F-test, Levene’s test)?
  5. Are pair-wise correlations between numerical input variable distributions consistent (e.g. Pearson’s or Spearman’s, threshold difference, Fisher’s z-test, etc.)?
  6. Are pair-wise correlations between numerical input and target variable distributions consistent (e.g. Pearson’s or Spearman’s, threshold difference, Fisher’s z-test, etc.)?
  7. Do numerical variables have a low divergence in distributions (e.g. KL-divergence or JS-divergence)?
  8. Does each variable in the train vs test set have a small effect size (e.g. Cohen’s d or Cliff’s delta)?
  9. Do input/target variables in the train and test sets have a similar distributions of univariate outliers (e.g. 1.5xIQR, 3xstdev, etc.)?
  10. Do the train and test sets have a similar pattern of missing values (e.g. statistical tests applied to number of missing values)?

Extensions

The above tests are used to compare variables in the train set with the test set:

  • Train Set vs Test Set

After that, consider repeating all tests for the combinations:

  • Train Set vs Whole Dataset (Train Set + Test Set)
  • Test Set vs Whole Dataset (Train Set + Test Set)

In some domains you may be able to collect additional large samples from the domain with which you can compare variable distributions with the train and test sets:

  • Train Set vs New Large Sample
  • Test Set vs New Large Sample