Stratify By Label
Stratification is the process of ensuring that the train and test sets have the same (or nearly the same) distribution of a key variable, typically the target variable such as the class label in supervised learning tasks.
It is especially important in cases where the target variable is imbalanced or has meaningful groupings that should be preserved.
For example: in a binary classification problem with a 90:10 ratio of classes, stratification ensures that both the train and test sets have roughly the same 90:10 ratio.
Why Is Stratification Important?
-
Preserving Representativeness:
- Stratification ensures that the train and test sets are representative of the entire dataset, particularly regarding the distribution of the target variable or other important groupings.
- Without stratification, one set (train or test) may have a disproportionately high or low number of samples from certain classes or groups.
-
Ensuring Reliable Model Evaluation:
- If the test set is not representative of the true target distribution, the performance metrics calculated on the test set may not reflect real-world performance.
-
Preventing Bias in Model Training:
- Imbalanced or unrepresentative training sets can lead to biased models that fail to learn meaningful relationships for underrepresented classes.
When Is Stratification Needed?
-
Imbalanced Binary Classification Problems:
- Example: Fraud detection, where fraudulent transactions (positive class) are much rarer than non-fraudulent transactions (negative class).
- Stratification ensures that both classes are proportionally represented in both the train and test sets.
-
Multiclass Classification Problems:
- Example: Image classification with several categories that have highly skewed frequencies (e.g., 50% “cat,” 30% “dog,” 20% “bird”).
- Stratification ensures that the proportions of each class are consistent across the subsets.
When Is Stratification Not Needed?
-
Balanced Datasets:
- If the target variable is already balanced (e.g., 50/50 binary classification or equal class sizes in multiclass problems), stratification is unnecessary.
-
Regression Problems:
- In regression tasks, the target variable is continuous, and stratification may not always be meaningful. Instead, techniques like binning the target variable into intervals can be used if desired.
-
Large Datasets:
- When the dataset is large, random splits often approximate the true distribution of the target variable, making stratification less critical.
-
Unsupervised Learning:
- In clustering or dimensionality reduction tasks, there is no explicit target variable to stratify.
Problems If Stratification Is Needed but Not Performed
-
Unrepresentative Test Set:
- If the test set does not reflect the distribution of the target variable, performance metrics will not accurately represent real-world performance.
- Example:
- A 90:10 dataset randomly split may result in a test set with 99% of one class and only 1% of the other, leading to misleading accuracy.
-
Bias in Model Training:
- An unbalanced training set may cause the model to overfit to the majority class and ignore the minority class.
- Example:
- In fraud detection, the model might predict “no fraud” for all cases if fraudulent examples are excluded from the training set.
-
Underperformance on Minority Classes:
- If minority classes are underrepresented or missing entirely in the training set, the model will fail to learn patterns related to these classes.
-
Poor Generalization:
- A model trained on an unrepresentative training set may generalize poorly to unseen data because it has not learned the true distribution of the data.
Best Practices for Stratification
-
Use Stratified Splitting Tools:
- Libraries like scikit-learn provide tools for stratified splitting (
StratifiedKFold
,StratifiedShuffleSplit
) to automate the process.
- Libraries like scikit-learn provide tools for stratified splitting (
-
Verify Target Distribution Post-Split:
- After splitting, check the class proportions in the train and test sets to ensure they match the original dataset.
-
Stratify on Multiple Factors (if necessary):
- In complex datasets, consider stratifying on combinations of key factors (e.g., class label and demographic group).
-
For Regression, Consider Binning:
- Convert continuous target variables into categorical bins (e.g., quartiles) and perform stratification on these bins.