Document Split

Diagnostics

Split Procedure

Document Split

Why Documenting The Split Is Important

Documentation of train-test splitting procedures is fundamental for several critical reasons:

Reproducibility: The exact way you split your data directly impacts model performance and evaluation metrics. Without detailed documentation of the splitting procedure, including random seeds, stratification methods, and splitting ratios, other researchers cannot recreate your results. This impedes both verification of your findings and building upon your work.
Selection Bias Detection: Documenting your splitting approach helps reveal potential selection biases. For instance, if you split time series data randomly rather than chronologically, or if you don’t maintain class distributions during splitting, these decisions can lead to overoptimistic performance estimates. Proper documentation makes such methodological issues visible.
Data Leakage Prevention: Clear documentation helps identify potential sources of data leakage. For example, if preprocessing steps like normalization or feature selection were performed before splitting, this could introduce leakage from the test set into the training process. Documenting the exact sequence of splitting and preprocessing helps catch these issues.
Model Evolution Management: When models are updated or retrained over time, documented splitting procedures ensure consistency. Without this documentation, different team members might use different splitting approaches, making model performance comparisons meaningless. This is especially important in production environments where models are regularly retrained.
Cross-Validation Integrity: If using cross-validation, documenting how folds were created and whether they preserve any data dependencies (like time series relationships or grouped observations) is crucial for valid performance estimation.
Experiment Comparison: When comparing different modeling approaches or hyperparameter settings, consistent dataset splitting is essential. Documentation ensures that performance differences are due to actual model improvements rather than different data splits.

What to Document?

Below is a template for what you could document:

1. Original Dataset Information

Dataset Source

File name and location: dataset.csv
Receipt date: [DATE]
Data provider: [NAME/ORGANIZATION]
Version/snapshot identifier: [VERSION]
Original row count: [NUMBER]
Original column count: [NUMBER]

Data Characteristics

Time period covered: [RANGE]
Known biases or limitations: [DESCRIPTION]
Data collection methodology: [DESCRIPTION]
Data format and schema: [DESCRIPTION]

2. Data Cleaning Process

Duplicate Handling

Duplication criteria: [DEFINITION]
Number of duplicates found: [NUMBER]
Number of duplicates removed: [NUMBER]
Row count before deduplication: [NUMBER]
Row count after deduplication: [NUMBER]

Data Quality Checks

Invalid data removal criteria: [CRITERIA]
Examples of removed records:
```
[EXAMPLE RECORDS]
```

3. Shuffle Configuration

Shuffle Strategy

Shuffle performed: [YES/NO]
Justification for shuffle decision: [EXPLANATION]
Group handling approach: [APPROACH]
Random seed value: [NUMBER]
Shuffle implementation:
```
# Code used for shuffling
```

4. Stratification Configuration

Stratification Strategy

Stratification performed: [YES/NO]
Stratification variables: [VARIABLES]
Group stratification approach: [APPROACH]
Target variable distribution before/after:
```
[DISTRIBUTION METRICS]
```

5. Split Configuration

Split Ratio

Training set proportion: [%]
Test set proportion: [%]
Validation set (if applicable): [%]
Alternatives considered: [OPTIONS]
Rationale for chosen split: [EXPLANATION]

Split Implementation

# Code used for performing the split
# Include all relevant parameters and configurations

6. Output Verification

Dataset Statistics

Training set size: [NUMBER]
Test set size: [NUMBER]
Feature distribution comparison:
```
[DISTRIBUTION METRICS]
```

Output Files

Training dataset: train.csv
Test dataset: test.csv
Validation dataset (if applicable): validation.csv
File checksums:
```
[CHECKSUMS]
```

7. Reproducibility Information

Environment Details

Python version: [VERSION]
Key library versions:
- pandas: [VERSION]
- scikit-learn: [VERSION]
- numpy: [VERSION]

Random State Control

Global random seed: [NUMBER]
Framework-specific seeds: [NUMBERS]

8. Additional Notes

Known limitations: [DESCRIPTION]
Special considerations: [DESCRIPTION]
Future improvements: [DESCRIPTION]

Split Size