Split Procedure

Diagnostics

Split Procedure

Ideally, the train set and test set are a statistically representative random sample from the domain.

Best practices provide guard rails to ensure that this objective is met in most cases.

The motivating question here is:

Is there evidence that the split of the dataset into train/test subsets followed best practices?

Best Practices:

Are samples in the dataset independent and identically distributed (i.i.d)?
Did you remove duplicates from the dataset before splitting into train and test sets?
- Remove Duplicates
Did you shuffle the dataset when you split into train and test sets?
- Shuffle
Did you stratify the split by class label (only classification tasks)?
- Stratify By Class Label
Did you stratify the split by domain entities? (only data with domain entities that have more than one example, e.g one customer with many transactions, one user with many recommendations, etc.)?
- Stratify By Domain Entity
Are the train/test sets disjoint (e.g. non-overlapping, do you need to confirm this)?
Did you use a typical split ratio (e.g. 70/30, 80/20, 90/10)?
- Split Percentages
Did you use a library function to perform the split?
- sklearn.model_selection.train_test_split
Did you document the split procedure (e.g. number of duplicates removed, split percentage, random number seed, reasons, etc.)?
- Document the Split
Did you save the original dataset, train set, and test set in separate files?