Compare Score Distributions
Procedure
This procedure assesses whether model performance on the train and test sets follows the same distribution, ensuring generalization to new data. This helps verify the validity of the train/test split and identify potential issues like overfitting or underfitting.
-
Evaluate Performance on Train Set
- What to do: Apply each algorithm in the suite to the train set using the chosen test harness.
- Use consistent evaluation methods, such as 10-fold cross-validation.
- Record performance metrics (e.g., accuracy, RMSE, F1-score) for each algorithm.
- What to do: Apply each algorithm in the suite to the train set using the chosen test harness.
-
Evaluate Performance on Test Set
- What to do: Apply the same algorithms to the test set using the chosen evaluation metric.
- Ensure strict separation of train and test data to avoid data leakage.
- Record the performance scores for comparison with the train set.
- What to do: Apply the same algorithms to the test set using the chosen evaluation metric.
-
Gather Performance Scores
- What to do: Collect performance metrics from both train and test evaluations.
- Organize the results in a structured table with rows for each algorithm and columns for train and test scores.
- What to do: Collect performance metrics from both train and test evaluations.
-
Test Distribution Similarity
- What to do: Use statistical tests to compare the distributions of performance scores from train and test sets.
- If the data meets assumptions of normality, consider tests like Kolmogorov-Smirnov or Anderson-Darling.
- For non-normal data, use non-parametric alternatives.
- Null Hypothesis: Train and test scores come from the same distribution.
- What to do: Use statistical tests to compare the distributions of performance scores from train and test sets.
-
Interpret the Results
- What to do: Analyze the statistical test results.
- If the p-value is high (e.g., > 0.05), fail to reject the null hypothesis, suggesting similar distributions.
- If the p-value is low (e.g., ≤ 0.05), reject the null hypothesis, indicating possible issues with the train/test split.
- What to do: Analyze the statistical test results.
-
Investigate Discrepancies
- What to do: If distributions differ, explore potential causes.
- Check for data leakage, inadequate train set size, or overfitting to specific features.
- Reassess preprocessing steps and splitting methods.
- What to do: If distributions differ, explore potential causes.
-
Report the Findings
- What to do: Summarize the test outcomes.
- Include statistical test results, p-values, and performance metrics for both sets.
- Provide interpretations and actionable recommendations based on the findings, such as refining the split strategy or improving data preprocessing.
- What to do: Summarize the test outcomes.
Code Example
This Python function implements a diagnostic test to evaluate the quality of the train/test split by comparing the distributions of model performance metrics across a suite of machine learning algorithms.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import ks_2samp
def evaluate_split_distribution(X_train, y_train, X_test, y_test, metric, k):
"""
Evaluate the quality of the train/test split by comparing performance metric distributions.
Parameters:
X_train, y_train: Training features and labels.
X_test, y_test: Testing features and labels.
metric (callable): Performance metric function.
k (int): Number of folds for k-fold cross-validation.
Returns:
dict: Results summarizing distribution comparison and split quality interpretation.
"""
# Define a suite of machine learning algorithms
algorithms = [
LogisticRegression(max_iter=1000),
DecisionTreeClassifier(),
KNeighborsClassifier(),
SVC(probability=True),
RandomForestClassifier(n_estimators=50),
GradientBoostingClassifier(),
DummyClassifier(strategy="most_frequent"),
DummyClassifier(strategy="stratified")
]
train_scores = []
test_scores = []
# Evaluate each algorithm on train and test sets
for model in algorithms:
# Cross-validate on train set
train_cv_score = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
train_scores.append(np.mean(train_cv_score))
# Cross-validate on test set
test_cv_score = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
test_scores.append(np.mean(test_cv_score))
# Perform Kolmogorov-Smirnov test to compare distributions
ks_stat, p_value = ks_2samp(train_scores, test_scores)
# Interpret results
alpha = 0.05
if p_value > alpha:
interpretation = "Train and test distributions are statistically similar. The split is likely adequate."
else:
interpretation = "Train and test distributions differ significantly. Consider revisiting the split or preprocessing."
return {
"KS Statistic": ks_stat,
"p-value": p_value,
"Significance Level": alpha,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5, random_state=42
)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform the diagnostic test
results = evaluate_split_distribution(
X_train, y_train,
X_test, y_test,
metric=accuracy_score,
k=10
)
# Print results
print("Train/Test Split Diagnostic Results:")
for key, value in results.items():
print(f"{key}: {value}")
Example Output
Train/Test Split Diagnostic Results:
KS Statistic: 0.375
p-value: 0.6601398601398599
Significance Level: 0.05
Interpretation: Train and test distributions are statistically similar. The split is likely adequate.