Compare Score Effect Size

Diagnostics

Procedure

This procedure evaluates whether model performance on the train and test sets generalizes effectively, checking if their distributions are consistent. This diagnostic ensures the train/test split is appropriate for the predictive modeling task.

Evaluate Performance on Train Set
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
  - Use methods like 10-fold cross-validation to evaluate performance.
  - Record the performance metrics (e.g., accuracy, RMSE) for each algorithm.
Evaluate Performance on Test Set
- What to do: Test the same suite of algorithms on the test set using the same performance metric.
  - Ensure strict separation between train and test data to avoid data leakage.
  - Record the performance metrics for each algorithm on the test set.
Gather Performance Scores
- What to do: Compile the performance metrics for both the train and test sets.
  - Organize the data into a structured format, such as a table with algorithms as rows and train/test scores as columns.
Measure Effect Size Between Train and Test Scores
- What to do: Quantify the difference between train and test performance metrics.
  - Use effect size measures like Cohen’s d (for parametric distributions) or Cliff’s delta (for non-parametric distributions).
  - Calculate the effect size for each algorithm.
Check for Consistency in Distributions
- What to do: Analyze the effect size results across algorithms.
  - A small effect size suggests consistent performance and similar distributions between train and test sets.
  - A large effect size may indicate issues such as overfitting, poor data splitting, or bias in the data.
Interpret the Results
- What to do: Assess the implications of the effect size measurements.
  - If effect sizes are consistently small, conclude that train and test distributions align well.
  - If large effect sizes are observed, identify potential causes such as inadequate preprocessing or imbalanced datasets.
Report the Findings
- What to do: Summarize the results and their implications for the train/test split.
  - Present a table of train and test performance scores, alongside the effect size for each algorithm.
  - Highlight any algorithms with significant discrepancies and suggest next steps, such as refining the dataset or splitting strategy.

Code Example

This Python function evaluates the quality of the train/test split by calculating the effect size of performance metrics between train and test sets across a suite of machine learning algorithms.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier

def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of the train/test split by calculating effect size
    between train and test performance metrics.

    Parameters:
        X_train, y_train: Features and target for the training set.
        X_test, y_test: Features and target for the testing set.
        metric (callable): A scoring function (e.g., accuracy_score).
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Results containing the effect size and an interpretation of the train/test split quality.
    """

    # Define a suite of algorithms for evaluation
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified"),
    ]

    train_scores = []
    test_scores = []

    # Evaluate each algorithm
    for model in algorithms:
        # Cross-validate on the train set
        train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_scores))

        # Cross-validate on the test set
        test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_scores))

    # Calculate effect size (Cohen's d)
    train_mean = np.mean(train_scores)
    test_mean = np.mean(test_scores)
    pooled_std = np.sqrt((np.var(train_scores) + np.var(test_scores)) / 2)
    effect_size = (train_mean - test_mean) / pooled_std

    # Interpret the results
    if abs(effect_size) < 0.2:
        interpretation = "Negligible effect size: train/test distributions are similar."
    elif abs(effect_size) < 0.5:
        interpretation = "Small effect size: minor differences in train/test distributions."
    elif abs(effect_size) < 0.8:
        interpretation = "Medium effect size: moderate differences in train/test distributions."
    else:
        interpretation = "Large effect size: significant differences in train/test distributions."

    return {
        "Effect Size": effect_size,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform the diagnostic test
    results = evaluate_train_test_split(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
Effect Size: 0.02057691144367342
Interpretation: Negligible effect size: train/test distributions are similar.

Compare Score Divergence Compare Scores Visually