Compare Score Variances
Procedure
This procedure evaluates whether the model performance on the train and test sets is consistent, ensuring that the performance distributions align. It helps verify the train/test split’s effectiveness in generalizing to new data.
-
Evaluate Performance on Train Set
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
- Use methods like 10-fold cross-validation for consistent evaluation.
- Record the performance metric (e.g., accuracy, RMSE) for each algorithm.
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
-
Evaluate Performance on Test Set
- What to do: Apply the same suite of algorithms to the test set using the same test harness.
- Ensure no data leakage between train and test datasets.
- Record the performance metric for each algorithm.
- What to do: Apply the same suite of algorithms to the test set using the same test harness.
-
Gather Performance Scores
- What to do: Collect the performance scores from train and test evaluations for all algorithms.
- Organize the results in a structured format (e.g., a table with rows for algorithms and columns for train and test scores).
- What to do: Collect the performance scores from train and test evaluations for all algorithms.
-
Check Variance Assumptions
- What to do: Test if the performance scores on train and test sets have the same variance.
- If scores are normally distributed, use a parametric test like the F-test.
- If scores are not normal, use a non-parametric test like Levene’s test.
- What to do: Test if the performance scores on train and test sets have the same variance.
-
Interpret Variance Test Results
- What to do: Analyze the variance test outcome to determine consistency between train and test distributions.
- If variances are similar, it suggests consistency between the train/test splits.
- Significant variance differences may indicate issues with data preprocessing or model overfitting.
- What to do: Analyze the variance test outcome to determine consistency between train and test distributions.
-
Interpret and Comment on Results
- What to do: Evaluate the alignment of train and test set distributions.
- If distributions align, conclude that the train/test split supports generalization.
- If distributions differ significantly, suggest potential issues (e.g., data leakage, insufficient data, or poor model selection).
- What to do: Evaluate the alignment of train and test set distributions.
-
Report Findings
- What to do: Summarize the diagnostic test results and implications.
- Provide variance test results and distribution comparisons.
- Highlight any discrepancies and recommend corrective actions, such as improving data preprocessing or revisiting the train/test split.
- What to do: Summarize the diagnostic test results and implications.
Code Example
This Python function evaluates the quality of a train/test split by analyzing the variances of model performance metrics across a suite of machine learning algorithms using k-fold cross-validation.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import levene, f_oneway
def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
"""
Evaluate the quality of the train/test split by comparing the variances of model performance metrics.
Parameters:
X_train, y_train: Training features and target.
X_test, y_test: Testing features and target.
metric (callable): Performance metric function.
k (int): Number of folds for k-fold cross-validation.
Returns:
dict: Results including variance comparison and interpretation.
"""
# Define a suite of algorithms to evaluate
algorithms = [
LogisticRegression(max_iter=1000),
DecisionTreeClassifier(),
KNeighborsClassifier(),
SVC(probability=True),
RandomForestClassifier(n_estimators=50),
GradientBoostingClassifier(),
DummyClassifier(strategy="most_frequent"),
DummyClassifier(strategy="stratified"),
]
train_scores = []
test_scores = []
# Evaluate each algorithm
for model in algorithms:
# Cross-validate on train set
train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
train_scores.append(np.mean(train_cv_scores))
# Cross-validate on test set
test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
test_scores.append(np.mean(test_cv_scores))
# Compare variances using Levene's test
stat, p_value = levene(train_scores, test_scores)
# Interpret the results
alpha = 0.05
significance = "significant" if p_value <= alpha else "not significant"
if significance == "significant":
interpretation = (
"Poor train/test split: Significant difference in variances indicates potential issues."
)
else:
interpretation = "Good train/test split: Variances are consistent between train and test sets."
return {
"N": len(algorithms),
"Variance Test Statistic": stat,
"p-value": p_value,
"Significance": significance,
"Interpretation": interpretation,
}
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5, random_state=42
)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform the diagnostic test
results = evaluate_train_test_split(
X_train, y_train,
X_test, y_test,
metric=accuracy_score,
k=10
)
# Print results
print("Train/Test Split Diagnostic Results:")
for key, value in results.items():
print(f"{key}: {value}")Example Output
Train/Test Split Diagnostic Results:
N: 8
Variance Test Statistic: 0.06434601354872317
p-value: 0.8034415845751293
Significance: not significant
Interpretation: Good train/test split: Variances are consistent between train and test sets.