What is Overfitting
TLDR: Overfitting is when a model performs well on training data but fails to generalize to new, unseen data.
Overfitting in machine learning occurs when a model learns the training data too thoroughly, capturing not only the underlying patterns but also the noise and outliers.
This excessive fit to the training data causes the model to perform exceptionally well on that data but poorly on new, unseen data. Essentially, the model fails to generalize beyond the specific examples it was trained on.
Overfitting typically arises when a model is overly complex relative to the amount and diversity of the training data.
Factors contributing to overfitting include having too many parameters, training the model for too many epochs, or using a model that is more intricate than necessary for the problem at hand. As a result, the model becomes sensitive to minor fluctuations in the training data, leading to high variance and unreliable predictions on new inputs.
To mitigate overfitting, techniques such as cross-validation, regularization, early stopping, and simplifying the model architecture are often employed. These methods aim to ensure that the model captures the true underlying patterns without being swayed by irrelevant noise, thereby improving its ability to generalize to new data.
Let’s take a moment to consider the notion of overfitting from a few different helpful perspectives:
- Bias-Variance Perspective: a model is overfit if it has low bias and high variance.
- Model Complexity Perspective: a model is overfit if it is too complex or has more capacity than is required for the problem.
- Optimization Perspective: a model is overfit if it is overly optimized for the training data rather than optimized for the problem the training data represents.
- Statistical Noise Perspective: a model is overfit if it captures noise in the training data rather than signal from the problem.
Bias-Variance Perspective
Understanding overfitting through the lens of the bias-variance trade-off helps us manage the balance between model complexity and generalization.
Bias-Variance Trade-Off: An Overview
In machine learning, the bias-variance trade-off is a framework for understanding the sources of error in predictive models:
- Bias refers to the error introduced by approximating a real-world problem (which may be complex) by a simpler model. High bias can cause the model to miss relevant relationships in the data, leading to underfitting.
- Variance refers to the error introduced by the model’s sensitivity to small fluctuations in the training data. High variance can cause the model to learn noise and overly complex relationships, leading to overfitting.
The overall prediction error can be broken down as:
$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$
How Overfitting Fits In
Overfitting occurs when a model has low bias but high variance.
This typically happens when a model is overly complex, such as a deep neural network with many layers or a decision tree with too many branches.
Such models can fit the training data perfectly but fail to generalize to new, unseen data.
Characteristics of Overfitting:
- High Training Accuracy: The model performs exceptionally well on the training set.
- Poor Test Performance: The model’s performance drops significantly when applied to validation or test data.
- Complex Models: Overfitting often happens with models that have too many parameters relative to the amount of training data.
The Balance: Bias vs. Variance
Managing overfitting requires finding the right balance between bias and variance:
- High Bias, Low Variance: Simpler models (e.g., linear regression) are fast to train and easy to interpret, but they may underfit the data. These models may miss relevant patterns and perform poorly on both the training and test sets.
- Low Bias, High Variance: Complex models (e.g., decision trees with many nodes or deep neural networks) may fit the training data very well but perform poorly on new data due to overfitting.
- Optimal Model: A model with an optimal level of complexity has balanced bias and variance, minimizing the overall error on both the training and test sets.
Model Complexity Perspective
We can understand overfitting from the perspective of model complexity, as follows:
Understanding Model Complexity
Model complexity refers to the capacity of a model to learn intricate patterns in the training data.
- Simple models, such as linear regression with a limited number of parameters, have low complexity and may only capture basic trends.
- Complex models, such as deep neural networks with many layers or decision trees with many splits, have high complexity and are capable of learning very detailed patterns.
When a model is overly complex, it may perform exceptionally well on the training data but fail to generalize to new, unseen data. This is because it has learned not only the true underlying patterns but also the random noise or peculiarities specific to the training set.
Causes of Overfitting Due to Complexity
- Too Many Parameters: Models with an excessive number of parameters relative to the amount of training data can learn complex relationships, including noise.
- Insufficient Regularization: Regularization techniques like L1 (Lasso) or L2 (Ridge) penalties are used to constrain model complexity. When regularization is insufficient, the model can become too flexible.
- Deep and Complex Architectures: In neural networks, models with too many layers or neurons may learn to memorize training data rather than generalizing from it.
Mitigating Overfitting
- Simplify the Model: Choose a simpler algorithm or reduce the number of features or parameters.
- Use Regularization: Apply regularization methods that penalize overly complex models and encourage simpler solutions.
- Cross-Validation: Use techniques like k-fold cross-validation to assess model performance across multiple training and validation sets.
- Pruning: In decision trees, reduce the number of branches to prevent the model from fitting noise.
- Early Stopping: For iterative algorithms like gradient descent in neural networks, stop training once performance on a validation set starts to degrade.
Optimization Perspective
Overfitting in machine learning, when viewed through the lens of optimization, is the result of a model that has been trained too well on known data (the training set) such that its performance on unseen data (the test or validation set) suffers.
Objective of Model Training
The goal of training a machine learning model is to find a set of parameters that minimizes an objective function (typically a loss function) on the training data.
However, the true objective is not just to minimize the loss on the training data, but to find a model that also performs well on unseen data. This performance on unseen data is captured by the concept of generalization.
Optimization Perspective
From an optimization perspective, training a model means iteratively adjusting its parameters to reduce the training error.
During this process, we are optimizing the model to fit the data it has access to. However, if this optimization is too aggressive or unconstrained, the model starts capturing not just the underlying patterns but also the noise and specific idiosyncrasies of the training data.
When the optimization focuses solely on minimizing the training loss without considering the complexity of the model or the variance of the data, the model may become highly tuned to the training set, resulting in overfitting. This can be seen as an optimization problem that finds a solution that is too specific to the training data distribution, failing to generalize to new data.
Generalization Error
The generalization error is the discrepancy between the performance of the model on the training data and its performance on unseen data. Ideally, the optimization should minimize the generalization error, but direct access to this error is unavailable during training. Instead, we rely on minimizing the training loss as a proxy, with the hope that this aligns well with the generalization performance.
Overfitting occurs when this alignment fails — the training loss is minimized to such an extent that the model starts learning noise or irrelevant features. While this may decrease the training error further, it leads to an increase in the variance of the model’s predictions, making it sensitive to variations in new data and thereby increasing the generalization error.
Statistical Noise Perspective
From the perspective of statistical noise, overfitting can be understood as a model’s inability to distinguish between the true underlying signal in the data and random fluctuations or noise. Here’s how this perspective helps explain overfitting:
Nature of Data and Noise
In any real-world dataset, there are two primary components:
- Signal: The true patterns or relationships that exist in the data, which we aim for our model to learn.
- Noise: Random, irrelevant variations that do not represent any meaningful or generalizable pattern. Noise can be due to measurement errors, outliers, or inherent randomness in the data.
A robust model should learn the signal and generalize well to new data by focusing on these true underlying patterns. However, it should ideally ignore the noise because it does not contribute to meaningful insights or predictions.
Model Training and Noise:
During training, an overly complex model with excessive capacity (e.g., too many parameters or a very flexible structure) can start to fit the noise in the training data.
This occurs because the optimization process seeks to minimize the training loss as much as possible. If the model is not constrained, it will attempt to capture every minute detail in the training set, including those caused by noise.
As a result:
- High Training Accuracy: The model achieves excellent performance on the training set because it fits not only the signal but also the noise.
- Poor Generalization: When the model is applied to new data, it fails to perform well because the noise-specific patterns it learned do not appear in the unseen data. This manifests as high generalization error.
Statistical Perspective of Overfitting
From a statistical standpoint, overfitting can be seen as a violation of the principle of parsimony or Occam’s Razor—the idea that simpler models are preferable because they are less likely to capture noise.
When a model captures noise:
- Variance Increases: The model’s output becomes highly sensitive to small changes in the input data. This is because it is modeling the noise, which varies randomly, rather than the stable signal.
- Bias Decreases: The training error is minimized, indicating that the model’s predictions are closely aligned with the training data. However, this low bias comes at the cost of high variance, making the model less reliable on new data.
Examples and Illustrations
Imagine fitting a curve to a set of data points:
- Underfitting: A simple model (e.g., a straight line) may fail to capture the true relationship, resulting in high bias and poor training performance.
- Proper Fit: A balanced model (e.g., a second- or third-degree polynomial) captures the main trends in the data without bending excessively to follow every point.
- Overfitting: A highly complex model (e.g., a high-degree polynomial) snakes through all the data points, including outliers and random fluctuations, achieving perfect training accuracy but failing to generalize.
Implications of Noise Fitting
When a model overfits by learning noise:
- Spurious Patterns: The model may detect relationships that do not exist. For instance, it might create complex decision boundaries in classification tasks that only make sense for the training set.
- Reduced Predictive Power: The model’s predictions for new data points become less accurate because the noise-induced patterns do not repeat in future observations.
Book Quotes
Below are a few cherry-picked quotes from machine learning textbooks to help clarify what we mean by overfitting:
When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data. Note that regardless of whether or not overfitting has occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.
– Page 32, An Introduction to Statistical Learning with Applications in R (2014)
We have repeatedly cautioned about the problem of overfitting, where a learned model is too closely tied to the particular training data from which it was built. It is incorrect to assume that performance on the training data faithfully represents the level of performance that can be expected on the fresh data to which the learned model will be applied in practice.
– Page 286, Data Mining: Practical Machine Learning Tools and Techniques [4th edition] (2016)
We will say that a hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e., including instances beyond the training set).
– Page 67, Machine Learning (1997)
Learning algorithms are particularly prone to overfitting, though, because they have an almost unlimited capacity to find patterns in data.
– Page 72, The Master Algorithm (2015)
In statistics, the name given to the act of mistaking noise for a signal is overfitting.
– Page 163, The Signal and the Noise (2012)
Models with insufficient capacity are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than needed to solve the present task they may overfit.
– Page 112, Deep Learning (2016)