Model Has Skill

Diagnostics

Performance Gap

Model Has Skill

Is the difference in train and test set scores caused by the chosen model not having skill on the problem?

Procedure

Establish a Baseline Model
- Create a simple baseline model (e.g., predict mean, median, or mode of the target variable).
- Evaluate the baseline model’s performance on the test set using relevant metrics (e.g., RMSE for regression, accuracy for classification).
Train and Evaluate the Chosen Model
- Train the chosen model using the training set.
- Evaluate the model on the test harness (e.g., k-fold cross-validation or validation set).
- Record the evaluation metrics such as accuracy, F1 score, or R^2.
Evaluate on the Test Set
- Use the test set to assess the chosen model’s performance on unseen data.
- Record the same metrics used during validation.
Compare Against Baseline
- Compare the chosen model’s performance to the baseline model:
  - Does the chosen model outperform the baseline model on both the test harness and the test set?
  - Calculate the percentage improvement of the chosen model over the baseline.
Quantify the Performance Gap
- Calculate the performance gap between:
  - The test harness (e.g., cross-validation or validation set) and the test set for the chosen model.
  - The performance gap for the baseline model, if relevant.
- Highlight significant gaps that suggest overfitting or underfitting.
Summarize Findings
- Report whether the chosen model has empirical evidence of outperforming the baseline model.
- Highlight if the performance gap warrants further investigation or if the model’s performance is stable across evaluation methods.

Limitations

Baseline Model Assumptions
- The choice of the baseline model impacts the comparison; a poor baseline may overstate the chosen model’s skill.
Dataset Size and Quality
- Small or low-quality datasets can lead to unstable performance estimates and misleading comparisons.
Metric Sensitivity
- Results depend on the selected evaluation metric. Ensure the metric aligns with the business or research goal.
Test Set Coverage
- If the test set does not adequately represent the real-world scenario, the observed performance gap may not generalize.
Overfitting to Validation
- Excessive tuning on the validation set can inflate performance metrics, leading to an inaccurate gap analysis.

Code Example

The following Python function implements a diagnostic test to evaluate the generalization gap for a chosen model and a dummy model, assessing whether the chosen model has sufficient skill compared to the baseline.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyClassifier, DummyRegressor

def diagnostic_test(train, test, problem_type, metric, k, chosen_model):
    """
    Perform a diagnostic test to evaluate the generalization gap and model skill.

    Parameters:
        train (tuple): Tuple containing (X_train, y_train).
        test (tuple): Tuple containing (X_test, y_test).
        problem_type (str): "classification" or "regression".
        metric: Performance metric function compatible with scikit-learn.
        k (int): Number of folds for cross-validation.
        chosen_model: The machine learning model to evaluate.

    Returns:
        dict: A dictionary containing performance gaps and skill assessment.
    """
    X_train, y_train = train
    X_test, y_test = test

    # Select appropriate dummy model
    if problem_type == "classification":
        dummy_model = DummyClassifier(strategy="most_frequent")
    elif problem_type == "regression":
        dummy_model = DummyRegressor(strategy="mean")
    else:
        raise ValueError("Invalid problem type. Choose 'classification' or 'regression'.")

    # Evaluate dummy model
    dummy_cv_scores = cross_val_score(dummy_model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    dummy_model.fit(X_train, y_train)
    dummy_test_score = metric(y_test, dummy_model.predict(X_test))
    dummy_gap = np.abs(np.mean(dummy_cv_scores) - dummy_test_score)

    # Evaluate chosen model
    chosen_cv_scores = cross_val_score(chosen_model, X_train, y_train, cv=k, scoring=make_scorer(metric))
    chosen_model.fit(X_train, y_train)
    chosen_test_score = metric(y_test, chosen_model.predict(X_test))
    chosen_gap = np.abs(np.mean(chosen_cv_scores) - chosen_test_score)

    # Compare skill
    skill_gap = chosen_test_score - dummy_test_score
    has_skill = skill_gap > 0

    return {
        "Dummy Model Gap": dummy_gap,
        "Chosen Model Gap": chosen_gap,
        "Skill Gap": skill_gap,
        "Has Skill": has_skill
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Define problem type, metric, and chosen model
    problem_type = "classification"
    metric = accuracy_score
    chosen_model = LogisticRegression()

    # Run the diagnostic test
    results = diagnostic_test(
        train=(X_train, y_train),
        test=(X_test, y_test),
        problem_type=problem_type,
        metric=metric,
        k=10,
        chosen_model=chosen_model
    )

    # Print Results
    print("Diagnostic Test Results:")
    for key, value in results.items():
        print(f"  - {key}: {value}")

Example Output

Diagnostic Test Results:
  - Dummy Model Gap: 0.025000000000000022
  - Chosen Model Gap: 0.01875000000000001
  - Skill Gap: 0.15000000000000002
  - Has Skill: True

Calculate Generalization Gap Performance Gap Reference Range