Split Size

Diagnostics

Split Size

The train/test split size is the proportion of the dataset allocated to the training set versus the test set when splitting data for machine learning tasks.

It is often described as a percentage of the original dataset, e.g. 80/20 where 80% of the dataset is allocated to the train set and 20% is allocated to the test set.

Why Do We Choose a Train/Test Split Size?

Balance Between Training and Testing:
- Training the model requires a sufficiently large and diverse dataset to learn meaningful patterns.
- Testing the model requires an independent subset to provide an unbiased evaluation of performance.
Dataset Size and Task Requirements:
- Larger datasets generally allow smaller proportions for testing because even a small percentage provides adequate examples for evaluation.
- Smaller datasets require a higher percentage of data for testing to ensure reliable evaluation.

Heuristic Percentages

General Guidelines for Split Ratios

70/30:
- Common for datasets of moderate size (e.g., thousands of examples).
- Allows ample data for both training and testing.
80/20:
- Suitable for larger datasets where fewer test examples are needed for reliable evaluation.
90/10:
- Used for very large datasets (e.g., tens of thousands or millions of examples).
- Provides more data for training while still having a large enough test set for evaluation.

Recommended Splits by Dataset Size

1. Tens of Examples (Very Small Datasets)

Split: 60/40 or 50/50.
Reason:
- The dataset is so small that maximizing both the training and test sets is crucial.
- Use techniques like cross-validation to make the most of the limited data.
Problem with Large Splits:
- Insufficient data in the test set can lead to unreliable performance estimates.

2. Hundreds of Examples (Small Datasets)

Split: 70/30 or 60/40.
Reason:
- A moderate split balances training set size with enough test data for meaningful evaluation.
- Cross-validation (e.g., k-fold) is still recommended to improve reliability.
Problem with Small Test Sets:
- Metrics may not generalize if the test set is too small or lacks diversity.

3. Thousands of Examples (Medium Datasets)

Split: 80/20 or 70/30.
Reason:
- There’s enough data for both training and testing to provide reliable model evaluation.
- Cross-validation can still be useful, but test size is generally adequate.
Problem with Overly Large Training Sets:
- An overly small test set may not reveal model weaknesses or provide statistical confidence in performance.

4. Tens of Thousands of Examples (Large Datasets)

Split: 80/20 or 90/10.
Reason:
- Large datasets allow smaller test proportions (e.g., 10%) to provide a statistically reliable test set.
- A larger training set ensures the model learns effectively.
Problem with Small Training Sets:
- If too much data is allocated to testing, the model may not train effectively, especially for complex tasks.

5. Millions of Examples (Very Large Datasets)

Split: 99/1, 95/5, or 90/10.
Reason:
- The sheer size of the dataset means even a small proportion (e.g., 1%) provides tens of thousands of examples for testing.
- Training on as much data as possible improves model performance.
Problem with Large Test Sets:
- Allocating a large percentage to testing unnecessarily reduces the training data.

Summary Table

Dataset Size	Recommended Split	Reason
Tens of Examples	50/50 or 60/40	Maximize both sets; use cross-validation to make up for limited data.
Hundreds of Examples	70/30 or 60/40	Ensure enough data in both sets; consider cross-validation.
Thousands	80/20 or 70/30	Balances training needs with sufficient test data for reliable evaluation.
Tens of Thousands	80/20 or 90/10	Training data is prioritized, but test set remains statistically meaningful.
Millions	99/1, 95/5, 90/10	Large datasets allow for smaller test proportions without sacrificing evaluation reliability.

Problems with Choosing a Bad Split Size

Insufficient Training Data:
- Too small a training set results in a poorly trained model that fails to learn meaningful patterns, leading to underfitting.
Insufficient Test Data:
- Too small a test set results in unreliable performance metrics.
- Metrics may not generalize to unseen data due to lack of diversity or statistical confidence in the test set.
Overly Large Test Set (Wasted Data):
- In large datasets, allocating too much data to testing wastes valuable training examples that could improve the model.
Imbalanced Representations:
- In imbalanced datasets, inappropriate splits may result in the minority class being underrepresented in the test set, leading to misleading evaluation.
Overfitting Evaluation Metrics:
- With small test sets, repeated evaluation on the same data can cause the model to overfit the evaluation criteria, resulting in inflated performance estimates.

Best Practices for Choosing Split Sizes

Consider Dataset Size and Task Complexity:
- Smaller datasets require higher test proportions for reliable evaluation.
- Larger datasets can afford smaller test proportions.
Validate Split Appropriateness:
- After splitting, verify that the training and test sets are representative of the original dataset (e.g., similar target distributions).
Use Cross-Validation for Small Datasets:
- For limited data, cross-validation ensures robust evaluation without wasting any data.
Iterate for Optimization:
- If unsure, experiment with multiple split sizes to determine the optimal tradeoff between training and testing performance.

Stratify By Entity Document Split