Performance Gap Reference Range

Diagnostics

Performance Gap

Is the performance gap within the range of expected performance gaps of standard untuned machine learning models?

Procedure

Prepare Data and Models
- Split the dataset into train and test sets using a fixed ratio (e.g., 80/20).
- Select at least 10+ untuned standard machine learning models (e.g., logistic regression, decision trees, SVM, k-NN).
- Ensure the models cover a diverse range of algorithms and assumptions.
Train and Evaluate Models on the Train Set
- Train each of the selected models on the train set.
- Evaluate their performance on the test harness (e.g., through k-fold cross-validation).
- Record the cross-validated performance metrics for each model.
Evaluate Models on the Hold-Out Test Set
- Evaluate the same trained models on the hold-out test set.
- Record their performance metrics on this unseen data.
Calculate Generalization Gaps
- For each model, compute the generalization gap as the absolute difference between the test harness performance and the test set performance.
- Aggregate these gaps to create a reference distribution.
Assess the Chosen Model’s Generalization Gap
- Calculate the generalization gap for the chosen model.
- Compare this gap to the reference distribution of standard model gaps.
Evaluate Confidence Interval and Standard Error
- Determine the confidence interval and standard error of the reference distribution.
- Check if the chosen model’s gap falls within:
  - The confidence interval of the reference distribution.
  - The standard error of the reference distribution.
Summarize Results and Insights
- Report whether the chosen model’s generalization gap is consistent with the standard model gaps.
- Highlight any deviations that warrant further investigation.
- Provide recommendations if the gap suggests potential overfitting or underfitting.

Limitations

Diversity of Models
- Results depend on the diversity of standard models chosen; using similar models may reduce the utility of the reference distribution.
Dataset Size
- Small datasets may lead to unstable generalization gap estimates due to high variance in splits.
Untuned Models
- Using untuned models may not fully represent the performance range of well-optimized models, potentially skewing the reference distribution.
Test Set Leakage
- Repeated evaluations on the same test set can inadvertently cause overfitting to the test set performance.
Context Sensitivity
- The diagnostic is specific to the chosen dataset and may not generalize across datasets or tasks without adjustments.

Code Example

Below is a Python function that calculates the reference distribution of the generalization gap across 10+ untuned standard machine learning algorithms, defining the min/max gaps, distribution statistics, and confidence intervals.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

def generalization_gap_analysis(train, test, problem_type, metric, k=10):
    """
    Calculate the reference distribution of generalization gaps across standard untuned ML algorithms.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        problem_type (str): "classification" or "regression".
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.

    Returns:
        dict: Dictionary containing min/max gaps, mean/median, and confidence interval for generalization gaps.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Define standard untuned models based on the problem type
    if problem_type == "classification":
        models = [
            LogisticRegression(), DecisionTreeClassifier(), RandomForestClassifier(),
            GradientBoostingClassifier(), SVC(), KNeighborsClassifier(),
            DummyClassifier(strategy="most_frequent")
        ]
    elif problem_type == "regression":
        models = [
            LinearRegression(), Ridge(), RandomForestRegressor(),
            GradientBoostingClassifier(), SVR(), KNeighborsRegressor(),
            DummyRegressor(strategy="mean")
        ]
    else:
        raise ValueError("Invalid problem type. Choose 'classification' or 'regression'.")

    # Calculate gaps for each model
    gaps = []
    for model in models:
        # Cross-validation performance
        cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
        model.fit(X_train, y_train)

        # Test set performance
        test_score = metric(y_test, model.predict(X_test))

        # Generalization gap
        gap = np.abs(np.mean(cv_scores) - test_score)
        gaps.append(gap)

    # Calculate distribution statistics
    mean_gap = np.mean(gaps)
    median_gap = np.median(gaps)
    std_gap = np.std(gaps)
    ci_low, ci_high = norm.interval(0.95, loc=mean_gap, scale=std_gap / np.sqrt(len(gaps)))

    return {
        "Min Gap": np.min(gaps),
        "Max Gap": np.max(gaps),
        "Mean Gap": mean_gap,
        "Median Gap": median_gap,
        "95% CI": (ci_low, ci_high)
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define problem type and metric
    problem_type = "classification"
    metric = accuracy_score

    # Run the generalization gap analysis
    results = generalization_gap_analysis(
        train=(X_train, y_train),
        test=(X_test, y_test),
        problem_type=problem_type,
        metric=metric,
        k=10
    )

    # Print Results
    print("Generalization Gap Analysis Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Generalization Gap Analysis Results:
  - Min Gap: 0.01750000000000007
  - Max Gap: 0.08750000000000008
  - Mean Gap: 0.03732142857142861
  - Median Gap: 0.03249999999999997
  - 95% CI: (np.float64(0.02111231557636773), np.float64(0.05353054156648949))

Model Has Skill Test Set Performance Bootstrap