Split Size Sensitivity
Sensitivity Analysis
Common split percentages are just heuristics, it is better to know how your data/model behave under different split scenarios.
If you have the luxury of time and compute, perform a sensitivity analysis of split sizes (e.g. between 50/50 to 99/1 with some linear interval) and optimize for one or more target properties of stability like data distribution and/or model performance.
The motivating question here is:
What evidence is there that one split size is better than another?
Split vs Data Distribution Sensitivity
Compare data distributions of train/test sets and find the point where they diverge (see Data Distribution Tests).
- Compare the correlations of each input/target variable (e.g. Pearson’s or Spearman’s).
- Compare the central tendencies for each input/target variable (e.g. t-Test or Mann-Whitney U Test).
- Compare the distributions for each input/target variable (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test).
- Compare the variance for each input/target variable (e.g. F-test, Levene’s test).
- Compare the divergence for the input/target variable distributions (e.g. KL-divergence or JS-divergence).
Split vs Performance Sensitivity
Compare the distributions of standard un-tuned machine learning model performance scores on train/test sets and find the point where they diverge (see Model Performance Tests).
- Compare the correlations of model performance scores (e.g. Pearson’s or Spearman’s).
- Compare the central tendencies of model performance scores (e.g. t-Test or Mann-Whitney U Test).
- Compare the distributions of model performance scores (e.g. Kolmogorov-Smirnov Test or Anderson-Darling Test).
- Compare the variance of model performance scores (e.g. F-test, Levene’s test).
- Compare the divergence of model performance score distributions (e.g. KL-divergence or JS-divergence).