Split Procedure
Split Procedure
Ideally, the train set and test set are a statistically representative random sample from the domain.
Best practices provide guard rails to ensure that this objective is met in most cases.
The motivating question here is:
Is there evidence that the split of the dataset into train/test subsets followed best practices?
Best Practices:
- Are samples in the dataset independent and identically distributed (i.i.d)?
- Did you remove duplicates from the dataset before splitting into train and test sets?
- Did you shuffle the dataset when you split into train and test sets?
- Did you stratify the split by class label (only classification tasks)?
- Did you stratify the split by domain entities? (only data with domain entities that have more than one example, e.g one customer with many transactions, one user with many recommendations, etc.)?
- Are the train/test sets disjoint (e.g. non-overlapping, do you need to confirm this)?
- Did you use a typical split ratio (e.g. 70/30, 80/20, 90/10)?
- Did you use a library function to perform the split?
- Did you document the split procedure (e.g. number of duplicates removed, split percentage, random number seed, reasons, etc.)?
- Did you save the original dataset, train set, and test set in separate files?
Related Questions
- Did you receive train/test sets already split (do they have separate sources)?
- Did a stakeholder withhold the test/validation set?
- Did you remove unambiguously irrelevant features (e.g. database id, last updated datetime, etc.)?
- According to domain experts, are samples from the dataset i.i.d (e.g. no known spatial/temporal/group dependence, concept/distribution shifts, sample bias, etc.)?