Compare Score Variances

Diagnostics

Procedure

This procedure evaluates whether the model performance on the train and test sets is consistent, ensuring that the performance distributions align. It helps verify the train/test split’s effectiveness in generalizing to new data.

Evaluate Performance on Train Set
- What to do: Apply the suite of algorithms to the train set using the chosen test harness.
  - Use methods like 10-fold cross-validation for consistent evaluation.
  - Record the performance metric (e.g., accuracy, RMSE) for each algorithm.
Evaluate Performance on Test Set
- What to do: Apply the same suite of algorithms to the test set using the same test harness.
  - Ensure no data leakage between train and test datasets.
  - Record the performance metric for each algorithm.
Gather Performance Scores
- What to do: Collect the performance scores from train and test evaluations for all algorithms.
  - Organize the results in a structured format (e.g., a table with rows for algorithms and columns for train and test scores).
Check Variance Assumptions
- What to do: Test if the performance scores on train and test sets have the same variance.
  - If scores are normally distributed, use a parametric test like the F-test.
  - If scores are not normal, use a non-parametric test like Levene’s test.
Interpret Variance Test Results
- What to do: Analyze the variance test outcome to determine consistency between train and test distributions.
  - If variances are similar, it suggests consistency between the train/test splits.
  - Significant variance differences may indicate issues with data preprocessing or model overfitting.
Interpret and Comment on Results
- What to do: Evaluate the alignment of train and test set distributions.
  - If distributions align, conclude that the train/test split supports generalization.
  - If distributions differ significantly, suggest potential issues (e.g., data leakage, insufficient data, or poor model selection).
Report Findings
- What to do: Summarize the diagnostic test results and implications.
  - Provide variance test results and distribution comparisons.
  - Highlight any discrepancies and recommend corrective actions, such as improving data preprocessing or revisiting the train/test split.

Code Example

This Python function evaluates the quality of a train/test split by analyzing the variances of model performance metrics across a suite of machine learning algorithms using k-fold cross-validation.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import levene, f_oneway

def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of the train/test split by comparing the variances of model performance metrics.

    Parameters:
        X_train, y_train: Training features and target.
        X_test, y_test: Testing features and target.
        metric (callable): Performance metric function.
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Results including variance comparison and interpretation.
    """
    # Define a suite of algorithms to evaluate
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified"),
    ]

    train_scores = []
    test_scores = []

    # Evaluate each algorithm
    for model in algorithms:
        # Cross-validate on train set
        train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_scores))

        # Cross-validate on test set
        test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_scores))

    # Compare variances using Levene's test
    stat, p_value = levene(train_scores, test_scores)

    # Interpret the results
    alpha = 0.05
    significance = "significant" if p_value <= alpha else "not significant"
    if significance == "significant":
        interpretation = (
            "Poor train/test split: Significant difference in variances indicates potential issues."
        )
    else:
        interpretation = "Good train/test split: Variances are consistent between train and test sets."

    return {
        "N": len(algorithms),
        "Variance Test Statistic": stat,
        "p-value": p_value,
        "Significance": significance,
        "Interpretation": interpretation,
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform the diagnostic test
    results = evaluate_train_test_split(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
N: 8
Variance Test Statistic: 0.06434601354872317
p-value: 0.8034415845751293
Significance: not significant
Interpretation: Good train/test split: Variances are consistent between train and test sets.

Compare Score Distributions Compare Score Divergence