Performance Gap Reference Range

Performance Gap Reference Range

Is the performance gap within the range of expected performance gaps of standard untuned machine learning models?

Procedure

  1. Prepare Data and Models

    • Split the dataset into train and test sets using a fixed ratio (e.g., 80/20).
    • Select at least 10+ untuned standard machine learning models (e.g., logistic regression, decision trees, SVM, k-NN).
    • Ensure the models cover a diverse range of algorithms and assumptions.
  2. Train and Evaluate Models on the Train Set

    • Train each of the selected models on the train set.
    • Evaluate their performance on the test harness (e.g., through k-fold cross-validation).
    • Record the cross-validated performance metrics for each model.
  3. Evaluate Models on the Hold-Out Test Set

    • Evaluate the same trained models on the hold-out test set.
    • Record their performance metrics on this unseen data.
  4. Calculate Generalization Gaps

    • For each model, compute the generalization gap as the absolute difference between the test harness performance and the test set performance.
    • Aggregate these gaps to create a reference distribution.
  5. Assess the Chosen Model’s Generalization Gap

    • Calculate the generalization gap for the chosen model.
    • Compare this gap to the reference distribution of standard model gaps.
  6. Evaluate Confidence Interval and Standard Error

    • Determine the confidence interval and standard error of the reference distribution.
    • Check if the chosen model’s gap falls within:
      • The confidence interval of the reference distribution.
      • The standard error of the reference distribution.
  7. Summarize Results and Insights

    • Report whether the chosen model’s generalization gap is consistent with the standard model gaps.
    • Highlight any deviations that warrant further investigation.
    • Provide recommendations if the gap suggests potential overfitting or underfitting.

Limitations

  • Diversity of Models

    • Results depend on the diversity of standard models chosen; using similar models may reduce the utility of the reference distribution.
  • Dataset Size

    • Small datasets may lead to unstable generalization gap estimates due to high variance in splits.
  • Untuned Models

    • Using untuned models may not fully represent the performance range of well-optimized models, potentially skewing the reference distribution.
  • Test Set Leakage

    • Repeated evaluations on the same test set can inadvertently cause overfitting to the test set performance.
  • Context Sensitivity

    • The diagnostic is specific to the chosen dataset and may not generalize across datasets or tasks without adjustments.

Code Example

Below is a Python function that calculates the reference distribution of the generalization gap across 10+ untuned standard machine learning algorithms, defining the min/max gaps, distribution statistics, and confidence intervals.

import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.stats import norm
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

def generalization_gap_analysis(train, test, problem_type, metric, k=10):
    """
    Calculate the reference distribution of generalization gaps across standard untuned ML algorithms.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        problem_type (str): "classification" or "regression".
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.

    Returns:
        dict: Dictionary containing min/max gaps, mean/median, and confidence interval for generalization gaps.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Define standard untuned models based on the problem type
    if problem_type == "classification":
        models = [
            LogisticRegression(), DecisionTreeClassifier(), RandomForestClassifier(),
            GradientBoostingClassifier(), SVC(), KNeighborsClassifier(),
            DummyClassifier(strategy="most_frequent")
        ]
    elif problem_type == "regression":
        models = [
            LinearRegression(), Ridge(), RandomForestRegressor(),
            GradientBoostingClassifier(), SVR(), KNeighborsRegressor(),
            DummyRegressor(strategy="mean")
        ]
    else:
        raise ValueError("Invalid problem type. Choose 'classification' or 'regression'.")

    # Calculate gaps for each model
    gaps = []
    for model in models:
        # Cross-validation performance
        cv_scores = cross_val_score(model, X_train, y_train, cv=k, scoring=make_scorer(metric))
        model.fit(X_train, y_train)

        # Test set performance
        test_score = metric(y_test, model.predict(X_test))

        # Generalization gap
        gap = np.abs(np.mean(cv_scores) - test_score)
        gaps.append(gap)

    # Calculate distribution statistics
    mean_gap = np.mean(gaps)
    median_gap = np.median(gaps)
    std_gap = np.std(gaps)
    ci_low, ci_high = norm.interval(0.95, loc=mean_gap, scale=std_gap / np.sqrt(len(gaps)))

    return {
        "Min Gap": np.min(gaps),
        "Max Gap": np.max(gaps),
        "Mean Gap": mean_gap,
        "Median Gap": median_gap,
        "95% CI": (ci_low, ci_high)
    }

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic dataset
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score

    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = X[:800], X[800:], y[:800], y[800:]

    # Define problem type and metric
    problem_type = "classification"
    metric = accuracy_score

    # Run the generalization gap analysis
    results = generalization_gap_analysis(
        train=(X_train, y_train),
        test=(X_test, y_test),
        problem_type=problem_type,
        metric=metric,
        k=10
    )

    # Print Results
    print("Generalization Gap Analysis Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Generalization Gap Analysis Results:
  - Min Gap: 0.01750000000000007
  - Max Gap: 0.08750000000000008
  - Mean Gap: 0.03732142857142861
  - Median Gap: 0.03249999999999997
  - 95% CI: (np.float64(0.02111231557636773), np.float64(0.05353054156648949))