Model Has Skill
Is the difference in train and test set scores caused by the chosen model not having skill on the problem?
Procedure
-
Establish a Baseline Model
- Create a simple baseline model (e.g., predict mean, median, or mode of the target variable).
- Evaluate the baseline model’s performance on the test set using relevant metrics (e.g., RMSE for regression, accuracy for classification).
-
Train and Evaluate the Chosen Model
- Train the chosen model using the training set.
- Evaluate the model on the test harness (e.g., k-fold cross-validation or validation set).
- Record the evaluation metrics such as accuracy, F1 score, or R^2.
-
Evaluate on the Test Set
- Use the test set to assess the chosen model’s performance on unseen data.
- Record the same metrics used during validation.
-
Compare Against Baseline
- Compare the chosen model’s performance to the baseline model:
- Does the chosen model outperform the baseline model on both the test harness and the test set?
- Calculate the percentage improvement of the chosen model over the baseline.
- Compare the chosen model’s performance to the baseline model:
-
Quantify the Performance Gap
- Calculate the performance gap between:
- The test harness (e.g., cross-validation or validation set) and the test set for the chosen model.
- The performance gap for the baseline model, if relevant.
- Highlight significant gaps that suggest overfitting or underfitting.
- Calculate the performance gap between:
-
Summarize Findings
- Report whether the chosen model has empirical evidence of outperforming the baseline model.
- Highlight if the performance gap warrants further investigation or if the model’s performance is stable across evaluation methods.
Limitations
-
Baseline Model Assumptions
- The choice of the baseline model impacts the comparison; a poor baseline may overstate the chosen model’s skill.
-
Dataset Size and Quality
- Small or low-quality datasets can lead to unstable performance estimates and misleading comparisons.
-
Metric Sensitivity
- Results depend on the selected evaluation metric. Ensure the metric aligns with the business or research goal.
-
Test Set Coverage
- If the test set does not adequately represent the real-world scenario, the observed performance gap may not generalize.
-
Overfitting to Validation
- Excessive tuning on the validation set can inflate performance metrics, leading to an inaccurate gap analysis.
Code Example
The following Python function implements a diagnostic test to evaluate the generalization gap for a chosen model and a dummy model, assessing whether the chosen model has sufficient skill compared to the baseline.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyClassifier, DummyRegressor
def diagnostic_test(train, test, problem_type, metric, k, chosen_model):
"""
Perform a diagnostic test to evaluate the generalization gap and model skill.
Parameters:
train (tuple): Tuple containing (X_train, y_train).
test (tuple): Tuple containing (X_test, y_test).
problem_type (str): "classification" or "regression".
metric: Performance metric function compatible with scikit-learn.
k (int): Number of folds for cross-validation.
chosen_model: The machine learning model to evaluate.
Returns:
dict: A dictionary containing performance gaps and skill assessment.
"""
X_train, y_train = train
X_test, y_test = test
# Select appropriate dummy model
if problem_type == "classification":
dummy_model = DummyClassifier(strategy="most_frequent")
elif problem_type == "regression":
dummy_model = DummyRegressor(strategy="mean")
else:
raise ValueError("Invalid problem type. Choose 'classification' or 'regression'.")
# Evaluate dummy model
dummy_cv_scores = cross_val_score(dummy_model, X_train, y_train, cv=k, scoring=make_scorer(metric))
dummy_model.fit(X_train, y_train)
dummy_test_score = metric(y_test, dummy_model.predict(X_test))
dummy_gap = np.abs(np.mean(dummy_cv_scores) - dummy_test_score)
# Evaluate chosen model
chosen_cv_scores = cross_val_score(chosen_model, X_train, y_train, cv=k, scoring=make_scorer(metric))
chosen_model.fit(X_train, y_train)
chosen_test_score = metric(y_test, chosen_model.predict(X_test))
chosen_gap = np.abs(np.mean(chosen_cv_scores) - chosen_test_score)
# Compare skill
skill_gap = chosen_test_score - dummy_test_score
has_skill = skill_gap > 0
return {
"Dummy Model Gap": dummy_gap,
"Chosen Model Gap": chosen_gap,
"Skill Gap": skill_gap,
"Has Skill": has_skill
}
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define problem type, metric, and chosen model
problem_type = "classification"
metric = accuracy_score
chosen_model = LogisticRegression()
# Run the diagnostic test
results = diagnostic_test(
train=(X_train, y_train),
test=(X_test, y_test),
problem_type=problem_type,
metric=metric,
k=10,
chosen_model=chosen_model
)
# Print Results
print("Diagnostic Test Results:")
for key, value in results.items():
print(f" - {key}: {value}")
Example Output
Diagnostic Test Results:
- Dummy Model Gap: 0.025000000000000022
- Chosen Model Gap: 0.01875000000000001
- Skill Gap: 0.15000000000000002
- Has Skill: True