Document Split
Why Documenting The Split Is Important
Documentation of train-test splitting procedures is fundamental for several critical reasons:
-
Reproducibility: The exact way you split your data directly impacts model performance and evaluation metrics. Without detailed documentation of the splitting procedure, including random seeds, stratification methods, and splitting ratios, other researchers cannot recreate your results. This impedes both verification of your findings and building upon your work.
-
Selection Bias Detection: Documenting your splitting approach helps reveal potential selection biases. For instance, if you split time series data randomly rather than chronologically, or if you don’t maintain class distributions during splitting, these decisions can lead to overoptimistic performance estimates. Proper documentation makes such methodological issues visible.
-
Data Leakage Prevention: Clear documentation helps identify potential sources of data leakage. For example, if preprocessing steps like normalization or feature selection were performed before splitting, this could introduce leakage from the test set into the training process. Documenting the exact sequence of splitting and preprocessing helps catch these issues.
-
Model Evolution Management: When models are updated or retrained over time, documented splitting procedures ensure consistency. Without this documentation, different team members might use different splitting approaches, making model performance comparisons meaningless. This is especially important in production environments where models are regularly retrained.
-
Cross-Validation Integrity: If using cross-validation, documenting how folds were created and whether they preserve any data dependencies (like time series relationships or grouped observations) is crucial for valid performance estimation.
-
Experiment Comparison: When comparing different modeling approaches or hyperparameter settings, consistent dataset splitting is essential. Documentation ensures that performance differences are due to actual model improvements rather than different data splits.
What to Document?
Below is a template for what you could document:
1. Original Dataset Information
Dataset Source
- File name and location:
dataset.csv
- Receipt date: [DATE]
- Data provider: [NAME/ORGANIZATION]
- Version/snapshot identifier: [VERSION]
- Original row count: [NUMBER]
- Original column count: [NUMBER]
Data Characteristics
- Time period covered: [RANGE]
- Known biases or limitations: [DESCRIPTION]
- Data collection methodology: [DESCRIPTION]
- Data format and schema: [DESCRIPTION]
2. Data Cleaning Process
Duplicate Handling
- Duplication criteria: [DEFINITION]
- Number of duplicates found: [NUMBER]
- Number of duplicates removed: [NUMBER]
- Row count before deduplication: [NUMBER]
- Row count after deduplication: [NUMBER]
Data Quality Checks
- Invalid data removal criteria: [CRITERIA]
- Examples of removed records:
[EXAMPLE RECORDS]
3. Shuffle Configuration
Shuffle Strategy
- Shuffle performed: [YES/NO]
- Justification for shuffle decision: [EXPLANATION]
- Group handling approach: [APPROACH]
- Random seed value: [NUMBER]
- Shuffle implementation:
# Code used for shuffling
4. Stratification Configuration
Stratification Strategy
- Stratification performed: [YES/NO]
- Stratification variables: [VARIABLES]
- Group stratification approach: [APPROACH]
- Target variable distribution before/after:
[DISTRIBUTION METRICS]
5. Split Configuration
Split Ratio
- Training set proportion: [%]
- Test set proportion: [%]
- Validation set (if applicable): [%]
- Alternatives considered: [OPTIONS]
- Rationale for chosen split: [EXPLANATION]
Split Implementation
# Code used for performing the split
# Include all relevant parameters and configurations
6. Output Verification
Dataset Statistics
- Training set size: [NUMBER]
- Test set size: [NUMBER]
- Feature distribution comparison:
[DISTRIBUTION METRICS]
Output Files
- Training dataset:
train.csv
- Test dataset:
test.csv
- Validation dataset (if applicable):
validation.csv
- File checksums:
[CHECKSUMS]
7. Reproducibility Information
Environment Details
- Python version: [VERSION]
- Key library versions:
- pandas: [VERSION]
- scikit-learn: [VERSION]
- numpy: [VERSION]
Random State Control
- Global random seed: [NUMBER]
- Framework-specific seeds: [NUMBERS]
8. Additional Notes
- Known limitations: [DESCRIPTION]
- Special considerations: [DESCRIPTION]
- Future improvements: [DESCRIPTION]