Compare Score Effect Size
Procedure
This procedure evaluates whether model performance on the train and test sets generalizes effectively, checking if their distributions are consistent. This diagnostic ensures the train/test split is appropriate for the predictive modeling task.
-
Evaluate Performance on Train Set
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
- Use methods like 10-fold cross-validation to evaluate performance.
- Record the performance metrics (e.g., accuracy, RMSE) for each algorithm.
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
-
Evaluate Performance on Test Set
- What to do: Test the same suite of algorithms on the test set using the same performance metric.
- Ensure strict separation between train and test data to avoid data leakage.
- Record the performance metrics for each algorithm on the test set.
- What to do: Test the same suite of algorithms on the test set using the same performance metric.
-
Gather Performance Scores
- What to do: Compile the performance metrics for both the train and test sets.
- Organize the data into a structured format, such as a table with algorithms as rows and train/test scores as columns.
- What to do: Compile the performance metrics for both the train and test sets.
-
Measure Effect Size Between Train and Test Scores
- What to do: Quantify the difference between train and test performance metrics.
- Use effect size measures like Cohen’s d (for parametric distributions) or Cliff’s delta (for non-parametric distributions).
- Calculate the effect size for each algorithm.
- What to do: Quantify the difference between train and test performance metrics.
-
Check for Consistency in Distributions
- What to do: Analyze the effect size results across algorithms.
- A small effect size suggests consistent performance and similar distributions between train and test sets.
- A large effect size may indicate issues such as overfitting, poor data splitting, or bias in the data.
- What to do: Analyze the effect size results across algorithms.
-
Interpret the Results
- What to do: Assess the implications of the effect size measurements.
- If effect sizes are consistently small, conclude that train and test distributions align well.
- If large effect sizes are observed, identify potential causes such as inadequate preprocessing or imbalanced datasets.
- What to do: Assess the implications of the effect size measurements.
-
Report the Findings
- What to do: Summarize the results and their implications for the train/test split.
- Present a table of train and test performance scores, alongside the effect size for each algorithm.
- Highlight any algorithms with significant discrepancies and suggest next steps, such as refining the dataset or splitting strategy.
- What to do: Summarize the results and their implications for the train/test split.
Code Example
This Python function evaluates the quality of the train/test split by calculating the effect size of performance metrics between train and test sets across a suite of machine learning algorithms.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
"""
Evaluate the quality of the train/test split by calculating effect size
between train and test performance metrics.
Parameters:
X_train, y_train: Features and target for the training set.
X_test, y_test: Features and target for the testing set.
metric (callable): A scoring function (e.g., accuracy_score).
k (int): Number of folds for k-fold cross-validation.
Returns:
dict: Results containing the effect size and an interpretation of the train/test split quality.
"""
# Define a suite of algorithms for evaluation
algorithms = [
LogisticRegression(max_iter=1000),
DecisionTreeClassifier(),
KNeighborsClassifier(),
SVC(probability=True),
RandomForestClassifier(n_estimators=50),
GradientBoostingClassifier(),
DummyClassifier(strategy="most_frequent"),
DummyClassifier(strategy="stratified"),
]
train_scores = []
test_scores = []
# Evaluate each algorithm
for model in algorithms:
# Cross-validate on the train set
train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
train_scores.append(np.mean(train_cv_scores))
# Cross-validate on the test set
test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
test_scores.append(np.mean(test_cv_scores))
# Calculate effect size (Cohen's d)
train_mean = np.mean(train_scores)
test_mean = np.mean(test_scores)
pooled_std = np.sqrt((np.var(train_scores) + np.var(test_scores)) / 2)
effect_size = (train_mean - test_mean) / pooled_std
# Interpret the results
if abs(effect_size) < 0.2:
interpretation = "Negligible effect size: train/test distributions are similar."
elif abs(effect_size) < 0.5:
interpretation = "Small effect size: minor differences in train/test distributions."
elif abs(effect_size) < 0.8:
interpretation = "Medium effect size: moderate differences in train/test distributions."
else:
interpretation = "Large effect size: significant differences in train/test distributions."
return {
"Effect Size": effect_size,
"Interpretation": interpretation
}
# Demo Usage
if __name__ == "__main__":
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5, random_state=42
)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform the diagnostic test
results = evaluate_train_test_split(
X_train, y_train,
X_test, y_test,
metric=accuracy_score,
k=10
)
# Print results
print("Train/Test Split Diagnostic Results:")
for key, value in results.items():
print(f"{key}: {value}")
Example Output
Train/Test Split Diagnostic Results:
Effect Size: 0.02057691144367342
Interpretation: Negligible effect size: train/test distributions are similar.