Bootstrap Chosen Model

Diagnostics

Interventions

Avoid presenting a point estimate of final chosen model performance by bootstrapping the test set with a model fit on the train set.

Bootstrap Method

The bootstrap method is a statistical resampling technique used to estimate the distribution of a statistic (e.g., mean, variance, or regression coefficient) by repeatedly sampling, with replacement, from the observed dataset.

It is particularly useful when the theoretical distribution of the statistic is unknown or difficult to derive.

Key steps:

Resampling: Randomly draw samples with replacement from the original dataset to create multiple “bootstrap samples,” each the same size as the original dataset.
Computation: Calculate the statistic of interest for each bootstrap sample.
Estimation: Use the collection of these statistics to estimate properties such as confidence intervals, standard errors, or the overall distribution.

The bootstrap is widely used in machine learning, model validation, and inference because it is non-parametric (makes no strong assumptions about the data’s distribution) and works well with small or complex datasets.

We can use the bootstrap method to estimate the distribution parameters (e.g. median, and confidence intervals) for the performance of our final chosen model

Procedure

Assumed Data Splits: Start with your fixed training set (used to fit the model) and hold-out test set (used for evaluation).
Bootstrap Resampling:
- Generate multiple bootstrap samples (with replacement) from the test set. Each sample has the same size as the test set.
Model Evaluation:
- For each bootstrap sample, use the final model (already trained on the full training set) to make predictions on the resampled test data.
- Calculate the performance metric (e.g., accuracy, F1-score, RMSE) on each bootstrap sample.
- If you can spare the compute, retrain the model on the train set for each bootstrap sample using a different random seed, to introduce (control-for) variance in the learning algorithm.
Aggregate Results:
- Collect the performance metrics from all bootstrap samples to create a performance distribution.
- Estimate the median of the performance distribution.
- Compute confidence intervals (e.g., 95%) by finding the appropriate percentiles of the performance metric values (e.g., 2.5th and 97.5th percentiles).

This approach quantifies the uncertainty in the model’s performance estimate on the test set without retraining the model, leveraging the variability in the test set.

Limitations

While the described bootstrap method is robust and widely used, it has several limitations:

Dependence on Test Set Size: If the hold-out test set is small, bootstrap resampling may not adequately capture the variability in performance. Resampling the same small dataset repeatedly can lead to overly optimistic estimates.
Potential Overlap in Resampled Data: Bootstrap resampling draws with replacement, meaning some test samples may appear multiple times in a single bootstrap sample while others may be omitted. This can exaggerate the variability of the performance estimate, especially with small datasets.
No Account for Model Variability: The model is trained only once on the full training data. Variability due to model training (e.g., random initialization or hyperparameter tuning) is not accounted for. This method assumes the model itself is stable, which might not hold for certain algorithms.
Computational Cost: While the model is trained once, predictions are made on multiple resampled test sets. For large datasets or complex models, this can become computationally expensive.
Bias in Estimation: Bootstrap confidence intervals might be biased if the test set is not representative of the overall population. The performance estimates depend heavily on how well the test set reflects the real-world data distribution.
Correlation Among Bootstrap Samples: Resampled test sets are not independent, as they are all drawn from the same original test set. This can lead to underestimating variability in performance.

Code Example

Below is an implementation of the bootstrap procedure for estimating the performance distribution of a model on a test set, including a synthetic dataset example:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Function for bootstrap performance estimation
def bootstrap_performance(train_data, test_data, model, metric, n_bootstrap=1000, random_state=None):
    """
    Estimate the performance of a model on a test set using bootstrap resampling.

    Parameters:
        train_data: tuple (X_train, y_train) - Training data and labels.
        test_data: tuple (X_test, y_test) - Test data and labels.
        model: Scikit-learn model - Predefined model to be trained.
        metric: Callable - Performance metric function (e.g., accuracy_score).
        n_bootstrap: int - Number of bootstrap resamples to perform.
        random_state: int or None - Seed for reproducibility.

    Returns:
        dict - Contains the median and 95% confidence intervals for the performance metric.
    """
    if random_state is not None:
        np.random.seed(random_state)

    X_train, y_train = train_data
    X_test, y_test = test_data

    # Fit the model on the full training data
    model.fit(X_train, y_train)

    # Collect bootstrap performance estimates
    performance_metrics = []
    for _ in range(n_bootstrap):
        # Resample the test set with replacement
        indices = np.random.choice(len(X_test), size=len(X_test), replace=True)
        X_resampled, y_resampled = X_test[indices], y_test[indices]

        # Predict and calculate performance on the resampled test set
        y_pred = model.predict(X_resampled)
        performance_metrics.append(metric(y_resampled, y_pred))

    # Calculate statistics
    performance_metrics = np.array(performance_metrics)
    median = np.median(performance_metrics)
    lower_bound = np.percentile(performance_metrics, 2.5)
    upper_bound = np.percentile(performance_metrics, 97.5)

    return {
        "median": median,
        "interval": (lower_bound, upper_bound)
    }

# Example: Using a synthetic dataset and RandomForestClassifier
# Create synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model and metric
model = RandomForestClassifier(random_state=42)
metric = accuracy_score

# Use bootstrap function to estimate performance
results = bootstrap_performance(
    train_data=(X_train, y_train),
    test_data=(X_test, y_test),
    model=model,
    metric=metric,
    n_bootstrap=1000,
    random_state=42
)

# Report results
print("Expected model performance:")
print(f"Median accuracy: {results['median']:.4f}")
print(f"95% Confidence Interval: {results['interval'][0]} to {results['interval'][1]}")

Example Output

After running the code, the expected output might look like this:

Expected model performance:
Median accuracy: 0.8867
95% Confidence Interval: 0.85 to 0.92

This reflects the model’s performance on the test set with a clear estimate of uncertainty based on the bootstrap procedure.

Ensemble Chosen Model