Split Size
The train/test split size is the proportion of the dataset allocated to the training set versus the test set when splitting data for machine learning tasks.
It is often described as a percentage of the original dataset, e.g. 80/20 where 80% of the dataset is allocated to the train set and 20% is allocated to the test set.
Why Do We Choose a Train/Test Split Size?
- Balance Between Training and Testing:
- Training the model requires a sufficiently large and diverse dataset to learn meaningful patterns.
- Testing the model requires an independent subset to provide an unbiased evaluation of performance.
- Dataset Size and Task Requirements:
- Larger datasets generally allow smaller proportions for testing because even a small percentage provides adequate examples for evaluation.
- Smaller datasets require a higher percentage of data for testing to ensure reliable evaluation.
Heuristic Percentages
General Guidelines for Split Ratios
- 70/30:
- Common for datasets of moderate size (e.g., thousands of examples).
- Allows ample data for both training and testing.
- 80/20:
- Suitable for larger datasets where fewer test examples are needed for reliable evaluation.
- 90/10:
- Used for very large datasets (e.g., tens of thousands or millions of examples).
- Provides more data for training while still having a large enough test set for evaluation.
Recommended Splits by Dataset Size
1. Tens of Examples (Very Small Datasets)
- Split: 60/40 or 50/50.
- Reason:
- The dataset is so small that maximizing both the training and test sets is crucial.
- Use techniques like cross-validation to make the most of the limited data.
- Problem with Large Splits:
- Insufficient data in the test set can lead to unreliable performance estimates.
2. Hundreds of Examples (Small Datasets)
- Split: 70/30 or 60/40.
- Reason:
- A moderate split balances training set size with enough test data for meaningful evaluation.
- Cross-validation (e.g., k-fold) is still recommended to improve reliability.
- Problem with Small Test Sets:
- Metrics may not generalize if the test set is too small or lacks diversity.
3. Thousands of Examples (Medium Datasets)
- Split: 80/20 or 70/30.
- Reason:
- There’s enough data for both training and testing to provide reliable model evaluation.
- Cross-validation can still be useful, but test size is generally adequate.
- Problem with Overly Large Training Sets:
- An overly small test set may not reveal model weaknesses or provide statistical confidence in performance.
4. Tens of Thousands of Examples (Large Datasets)
- Split: 80/20 or 90/10.
- Reason:
- Large datasets allow smaller test proportions (e.g., 10%) to provide a statistically reliable test set.
- A larger training set ensures the model learns effectively.
- Problem with Small Training Sets:
- If too much data is allocated to testing, the model may not train effectively, especially for complex tasks.
5. Millions of Examples (Very Large Datasets)
- Split: 99/1, 95/5, or 90/10.
- Reason:
- The sheer size of the dataset means even a small proportion (e.g., 1%) provides tens of thousands of examples for testing.
- Training on as much data as possible improves model performance.
- Problem with Large Test Sets:
- Allocating a large percentage to testing unnecessarily reduces the training data.
Summary Table
Dataset Size | Recommended Split | Reason |
---|---|---|
Tens of Examples | 50/50 or 60/40 | Maximize both sets; use cross-validation to make up for limited data. |
Hundreds of Examples | 70/30 or 60/40 | Ensure enough data in both sets; consider cross-validation. |
Thousands | 80/20 or 70/30 | Balances training needs with sufficient test data for reliable evaluation. |
Tens of Thousands | 80/20 or 90/10 | Training data is prioritized, but test set remains statistically meaningful. |
Millions | 99/1, 95/5, 90/10 | Large datasets allow for smaller test proportions without sacrificing evaluation reliability. |
Problems with Choosing a Bad Split Size
-
Insufficient Training Data:
- Too small a training set results in a poorly trained model that fails to learn meaningful patterns, leading to underfitting.
-
Insufficient Test Data:
- Too small a test set results in unreliable performance metrics.
- Metrics may not generalize to unseen data due to lack of diversity or statistical confidence in the test set.
-
Overly Large Test Set (Wasted Data):
- In large datasets, allocating too much data to testing wastes valuable training examples that could improve the model.
-
Imbalanced Representations:
- In imbalanced datasets, inappropriate splits may result in the minority class being underrepresented in the test set, leading to misleading evaluation.
-
Overfitting Evaluation Metrics:
- With small test sets, repeated evaluation on the same data can cause the model to overfit the evaluation criteria, resulting in inflated performance estimates.
Best Practices for Choosing Split Sizes
-
Consider Dataset Size and Task Complexity:
- Smaller datasets require higher test proportions for reliable evaluation.
- Larger datasets can afford smaller test proportions.
-
Validate Split Appropriateness:
- After splitting, verify that the training and test sets are representative of the original dataset (e.g., similar target distributions).
-
Use Cross-Validation for Small Datasets:
- For limited data, cross-validation ensures robust evaluation without wasting any data.
-
Iterate for Optimization:
- If unsure, experiment with multiple split sizes to determine the optimal tradeoff between training and testing performance.