Data Distribution

Diagnostics

Data Distribution

We assume that the train set and the test set are a representative statistical sample from the domain.

As such, they should have the same data distributions.

Is there statistical evidence that the train and test sets have the same data distributions?

⚠️

Warning: Do not inspect summary statistics, plots, or any exploratory data analysis (EDA) of train vs test sets. You may learn details of the test set that may influence data preparation, model selection, etc. introducing test set leakage and potential bias in the final model.

Checks

Do numerical input/target variables have the same central tendencies (e.g. t-Test or Mann-Whitney U Test)?
- Compare Numerical Central Tendencies
Do numerical input/target variables have the same distributions (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test)?
- Compare Numerical Distributions
Do categorical input/target variables have the same distributions (e.g. Chi-Square Test)?
- Compare Categorical Distributions
Do numerical input/target variables have the same distribution variance (e.g. F-test, Levene’s test)?
- Compare Numerical Variance
Are pair-wise correlations between numerical input variable distributions consistent (e.g. Pearson’s or Spearman’s, threshold difference, Fisher’s z-test, etc.)?
- Compare Numerical Input-Input Pairwise Correlations
Are pair-wise correlations between numerical input and target variable distributions consistent (e.g. Pearson’s or Spearman’s, threshold difference, Fisher’s z-test, etc.)?
- Compare Numerical Input-Target Pairwise Correlations
Do numerical variables have a low divergence in distributions (e.g. KL-divergence or JS-divergence)?
- Compare Distribution Divergence
Does each variable in the train vs test set have a small effect size (e.g. Cohen’s d or Cliff’s delta)?
- Calculate Numerical Effect Size
Do input/target variables in the train and test sets have a similar distributions of univariate outliers (e.g. 1.5xIQR, 3xstdev, etc.)?
- Compare Distribution of Univariate Outliers
Do the train and test sets have a similar pattern of missing values (e.g. statistical tests applied to number of missing values)?
- Compare Distribution of Missing Values

Extensions

The above tests are used to compare variables in the train set with the test set:

Train Set vs Test Set

After that, consider repeating all tests for the combinations:

Train Set vs Whole Dataset (Train Set + Test Set)
Test Set vs Whole Dataset (Train Set + Test Set)

In some domains you may be able to collect additional large samples from the domain with which you can compare variable distributions with the train and test sets:

Train Set vs New Large Sample
Test Set vs New Large Sample