Ensemble Chosen Model

Diagnostics

Interventions

Reduce the variance in predictions (prediction errors) of the final chosen model.

High Variance in the Final Chosen Model

Modern machine learning models (e.g. XGBoost, Random Forest, Neural Nets, etc.) are inherent unstable given slightly initial different conditions.

This results in a variance in the predictions made by the final model.

Variance refers to the model’s sensitivity to fluctuations in the training data or random initialization (e.g., random number seeds). High variance models tend to perform well on the specific training data they were trained on but can yield significantly different predictions and errors when retrained on slightly different data or with different random seeds. This instability can lead to less reliable predictions in real-world scenarios.

When a model exhibits high variance, its performance estimates on a test set may not fully capture its predictive behavior in deployment. While the test set provides a snapshot of performance, each re-fit of the model introduces variability that can lead to a range of outcomes in terms of predictions and prediction errors. This means that the final chosen model might perform differently in practice depending on the specific realization of the training process. For data scientists, this inconsistency can present challenges in deploying the model confidently and communicating its reliability to stakeholders.

The root cause of this issue often lies in the complexity of the model or the underlying randomness in training processes, such as random initialization of weights in neural networks or sampling noise in ensemble methods like random forests or gradient boosting. While methods such as cross-validation can help provide robust performance estimates, they don’t eliminate the inherent variability of high-variance models.

Ensemble of Final Chosen Models

One effective approach to addressing high variance in a final chosen model is to create an ensemble of models.

In this technique, multiple instances of the final chosen model are trained on the same training dataset but with different random number seeds (or other sources of randomness, such as data shuffling or weight initialization).

Each model in the ensemble independently learns the patterns in the data, and their predictions are combined—often through averaging (for regression) or majority voting (for classification)—to produce the final output.

By aggregating the predictions of multiple models, the ensemble approach reduces the impact of any one model’s high variance, leading to more stable and reliable predictions.

This is because the errors made by individual models are less likely to align perfectly, and their aggregation tends to smooth out fluctuations, yielding lower variance in the overall prediction.

This technique is particularly useful when the chosen model is inherently high-variance but has strong individual performance.

While it requires additional computational resources to train and store multiple models, the benefits often outweigh the costs, especially in scenarios where prediction stability and reliability are critical.

Procedure

Set Up the Ensemble Framework:
- Define the number of models (N) to include in the ensemble. This depends on available computational resources and the desired reduction in variance.
Train Multiple Models:
- For each model (i) in the ensemble ((i = 1, 2, \ldots, N)):
  1. Set a unique random seed for the model (e.g., using a random number generator or manually specifying seeds).
  2. Fit the chosen model architecture on the full training dataset using the set random seed.
  3. Save the trained model for future use (e.g., serialize and store it).
Aggregate the Ensemble:
- Collect all (N) trained models into a structured format (e.g., a list or dictionary for easy access).
Make Predictions with the Ensemble:
- For a new dataset:
  1. Use each trained model in the ensemble to generate predictions for the input data.
  2. Combine the predictions from all models. Typically, take the average of the predictions for regression tasks or use majority voting for classification tasks.
Output the Final Prediction:
- Return the aggregated prediction as the ensemble’s final output (e.g. mean).

Code Example

Below is Python code to demonstrate creating an ensemble of models trained with different random seeds and making predictions using their averaged outputs.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

# Function to create an ensemble of models
def create_model_ensemble(model_class, model_params, X_train, y_train, num_models):
    """
    Trains an ensemble of models with different random seeds.

    Args:
    - model_class: The model class to be instantiated (e.g., GradientBoostingRegressor).
    - model_params: Dictionary of parameters to initialize the model.
    - X_train: Feature matrix for training.
    - y_train: Target values for training.
    - num_models: Number of models in the ensemble.

    Returns:
    - models: List of trained models.
    """
    models = []
    for seed in range(num_models):
        model_params["random_state"] = seed
        model = model_class(**model_params)
        model.fit(X_train, y_train)
        models.append(model)
    return models

# Function to make predictions using an ensemble
def predict_with_ensemble(models, X_new):
    """
    Makes predictions using an ensemble of models by averaging their outputs.

    Args:
    - models: List of trained models.
    - X_new: Feature matrix for which predictions are to be made.

    Returns:
    - ensemble_predictions: Averaged predictions from the ensemble.
    """
    predictions = np.array([model.predict(X_new) for model in models])
    ensemble_predictions = predictions.mean(axis=0)
    return ensemble_predictions

# Demonstration with synthetic data
if __name__ == "__main__":
    # Create a synthetic regression dataset
    X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

    # Define the base model and its parameters
    model_class = RandomForestRegressor
    model_params = {"n_estimators": 100, "max_depth": 3}

    # Create an ensemble of models
    num_models = 10
    ensemble = create_model_ensemble(model_class, model_params, X, y, num_models)

    # Make predictions with the ensemble
    y_pred_ensemble = predict_with_ensemble(ensemble, X[:5])

    print(y_pred_ensemble)

Example Output

[ 23.61090893 -46.93659589  93.71510746 -90.87090628 -23.67096723]

Bootstrap Chosen Model Nested Cross-Validation