Split Size Sensitivity

Split Size Sensitivity

Sensitivity Analysis

Common split percentages are just heuristics, it is better to know how your data/model behave under different split scenarios.

If you have the luxury of time and compute, perform a sensitivity analysis of split sizes (e.g. between 50/50 to 99/1 with some linear interval) and optimize for one or more target properties of stability like data distribution and/or model performance.

The motivating question here is:

What evidence is there that one split size is better than another?

Split vs Data Distribution Sensitivity

Compare data distributions of train/test sets and find the point where they diverge (see Data Distribution Tests).

  1. Compare the correlations of each input/target variable (e.g. Pearson’s or Spearman’s).
  2. Compare the central tendencies for each input/target variable (e.g. t-Test or Mann-Whitney U Test).
  3. Compare the distributions for each input/target variable (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test).
  4. Compare the variance for each input/target variable (e.g. F-test, Levene’s test).
  5. Compare the divergence for the input/target variable distributions (e.g. KL-divergence or JS-divergence).

Split vs Performance Sensitivity

Compare the distributions of standard un-tuned machine learning model performance scores on train/test sets and find the point where they diverge (see Model Performance Tests).

  1. Compare the correlations of model performance scores (e.g. Pearson’s or Spearman’s).
  2. Compare the central tendencies of model performance scores (e.g. t-Test or Mann-Whitney U Test).
  3. Compare the distributions of model performance scores (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test).
  4. Compare the variance of model performance scores (e.g. F-test, Levene’s test).
  5. Compare the divergence of model performance score distributions (e.g. KL-divergence or JS-divergence).