Check Performance Central Tendencies
Procedure
This procedure evaluates whether the model performance on the train and test sets generalizes effectively, ensuring that the distributions of performance metrics are statistically consistent.
-
Evaluate Performance on Train Set
- What to do: Apply the chosen suite of machine learning algorithms to the train set using the selected test harness.
- Use methods like 10-fold cross-validation for robust performance evaluation.
- Record performance metrics (e.g., accuracy, F1-score, RMSE) for each algorithm.
- What to do: Apply the chosen suite of machine learning algorithms to the train set using the selected test harness.
-
Evaluate Performance on Test Set
- What to do: Apply the same algorithms to the test set.
- Use identical performance metrics as those used for the train set evaluation.
- Verify that no data leakage has occurred between train and test sets.
- What to do: Apply the same algorithms to the test set.
-
Gather Performance Metrics
- What to do: Collect the performance scores for each algorithm from the train and test sets.
- Organize the metrics in a clear format, such as a table with rows for algorithms and columns for train and test scores.
- What to do: Collect the performance scores for each algorithm from the train and test sets.
-
Assess Distribution of Performance Scores
- What to do: Analyze the distribution of performance scores for the train and test sets.
- Note: We could assess the distributions of scores per algorithm and/or across algorithms.
- Check if the distributions are normal using statistical tests like Shapiro-Wilk or visual tools like histograms.
- If normally distributed, consider parametric tests (e.g., t-test). If non-normal, use non-parametric tests (e.g., Mann-Whitney U Test).
- What to do: Analyze the distribution of performance scores for the train and test sets.
-
Perform Statistical Test on Distributions
- What to do: Conduct a statistical test to compare the central tendencies of train and test performance scores.
- If using a t-test, ensure equal variance assumptions are met; otherwise, use Welch’s t-test.
- For non-parametric methods, evaluate the ranks of the scores to detect significant differences with the Mann-Whitney U Test.
- What to do: Conduct a statistical test to compare the central tendencies of train and test performance scores.
-
Interpret Test Results
- What to do: Assess the p-value from the statistical test.
- A high p-value (e.g., > 0.05) suggests that train and test distributions are statistically similar, indicating effective generalization.
- A low p-value (e.g., < 0.05) may indicate issues with the train/test split or overfitting, requiring further investigation.
- What to do: Assess the p-value from the statistical test.
-
Report Findings
- What to do: Summarize the outcomes and implications of the statistical test.
- Present the performance metrics and statistical test results in a concise format.
- Highlight any discrepancies and suggest next steps, such as re-examining data preprocessing, revisiting model selection, or adjusting the train/test split.
- What to do: Summarize the outcomes and implications of the statistical test.
Code Example
This Python function evaluates the quality of a train/test split by comparing the central tendencies of model performance metrics across a suite of machine learning algorithms.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import ttest_ind, mannwhitneyu, shapiro
def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
"""
Evaluate the quality of the train/test split by comparing the central tendencies
of model performance scores using statistical tests.
Parameters:
X_train, y_train: Training features and target.
X_test, y_test: Testing features and target.
metric (callable): Performance metric function.
k (int): Number of folds for k-fold cross-validation.
Returns:
dict: Summary of test results, including statistical comparison and interpretation.
"""
# Define a list of machine learning algorithms
algorithms = [
LogisticRegression(max_iter=1000),
DecisionTreeClassifier(),
KNeighborsClassifier(),
SVC(probability=True),
RandomForestClassifier(n_estimators=50),
GradientBoostingClassifier(),
DummyClassifier(strategy="most_frequent"),
DummyClassifier(strategy="stratified"),
]
train_scores = []
test_scores = []
# Evaluate each algorithm on train and test sets
for model in algorithms:
# Cross-validation on train set
train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
train_scores.append(np.mean(train_cv_scores))
# Cross-validation on test set
test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
test_scores.append(np.mean(test_cv_scores))
# Check normality of scores
_, p_train_normal = shapiro(train_scores)
_, p_test_normal = shapiro(test_scores)
# Choose appropriate statistical test based on normality
alpha = 0.05
if p_train_normal > alpha and p_test_normal > alpha:
test_method = "t-test"
stat, p_value = ttest_ind(train_scores, test_scores, equal_var=False)
else:
test_method = "Mann-Whitney U Test"
stat, p_value = mannwhitneyu(train_scores, test_scores)
# Interpret the results
significance = "significant" if p_value <= alpha else "not significant"
if significance == "significant":
interpretation = (
"The central tendencies of train and test scores are significantly different, indicating potential issues with the train/test split."
)
else:
interpretation = (
"The central tendencies of train and test scores are not significantly different, suggesting the train/test split is likely appropriate."
)
return {
"Test Method": test_method,
"Statistic Value": stat,
"p-value": p_value,
"Significance": significance,
"Interpretation": interpretation,
"N Models Evaluated": len(algorithms),
}
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform diagnostic evaluation
results = evaluate_train_test_split(
X_train, y_train,
X_test, y_test,
metric=accuracy_score,
k=10
)
# Print results
print("Train/Test Split Diagnostic Results:")
for key, value in results.items():
print(f"{key}: {value}")
Example Output
Train/Test Split Diagnostic Results:
Test Method: Mann-Whitney U Test
Statistic Value: 31.0
p-value: 0.9579672518055619
Significance: not significant
Interpretation: The central tendencies of train and test scores are not significantly different, suggesting the train/test split is likely appropriate.
N Models Evaluated: 8