Check Performance Central Tendencies

Diagnostics

Procedure

This procedure evaluates whether the model performance on the train and test sets generalizes effectively, ensuring that the distributions of performance metrics are statistically consistent.

Evaluate Performance on Train Set
- What to do: Apply the chosen suite of machine learning algorithms to the train set using the selected test harness.
  - Use methods like 10-fold cross-validation for robust performance evaluation.
  - Record performance metrics (e.g., accuracy, F1-score, RMSE) for each algorithm.
Evaluate Performance on Test Set
- What to do: Apply the same algorithms to the test set.
  - Use identical performance metrics as those used for the train set evaluation.
  - Verify that no data leakage has occurred between train and test sets.
Gather Performance Metrics
- What to do: Collect the performance scores for each algorithm from the train and test sets.
  - Organize the metrics in a clear format, such as a table with rows for algorithms and columns for train and test scores.
Assess Distribution of Performance Scores
- What to do: Analyze the distribution of performance scores for the train and test sets.
  - Note: We could assess the distributions of scores per algorithm and/or across algorithms.
  - Check if the distributions are normal using statistical tests like Shapiro-Wilk or visual tools like histograms.
  - If normally distributed, consider parametric tests (e.g., t-test). If non-normal, use non-parametric tests (e.g., Mann-Whitney U Test).
Perform Statistical Test on Distributions
- What to do: Conduct a statistical test to compare the central tendencies of train and test performance scores.
  - If using a t-test, ensure equal variance assumptions are met; otherwise, use Welch’s t-test.
  - For non-parametric methods, evaluate the ranks of the scores to detect significant differences with the Mann-Whitney U Test.
Interpret Test Results
- What to do: Assess the p-value from the statistical test.
  - A high p-value (e.g., > 0.05) suggests that train and test distributions are statistically similar, indicating effective generalization.
  - A low p-value (e.g., < 0.05) may indicate issues with the train/test split or overfitting, requiring further investigation.
Report Findings
- What to do: Summarize the outcomes and implications of the statistical test.
  - Present the performance metrics and statistical test results in a concise format.
  - Highlight any discrepancies and suggest next steps, such as re-examining data preprocessing, revisiting model selection, or adjusting the train/test split.

Code Example

This Python function evaluates the quality of a train/test split by comparing the central tendencies of model performance metrics across a suite of machine learning algorithms.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import ttest_ind, mannwhitneyu, shapiro

def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of the train/test split by comparing the central tendencies
    of model performance scores using statistical tests.

    Parameters:
        X_train, y_train: Training features and target.
        X_test, y_test: Testing features and target.
        metric (callable): Performance metric function.
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Summary of test results, including statistical comparison and interpretation.
    """

    # Define a list of machine learning algorithms
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified"),
    ]

    train_scores = []
    test_scores = []

    # Evaluate each algorithm on train and test sets
    for model in algorithms:
        # Cross-validation on train set
        train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_scores))

        # Cross-validation on test set
        test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_scores))

    # Check normality of scores
    _, p_train_normal = shapiro(train_scores)
    _, p_test_normal = shapiro(test_scores)

    # Choose appropriate statistical test based on normality
    alpha = 0.05
    if p_train_normal > alpha and p_test_normal > alpha:
        test_method = "t-test"
        stat, p_value = ttest_ind(train_scores, test_scores, equal_var=False)
    else:
        test_method = "Mann-Whitney U Test"
        stat, p_value = mannwhitneyu(train_scores, test_scores)

    # Interpret the results
    significance = "significant" if p_value <= alpha else "not significant"
    if significance == "significant":
        interpretation = (
            "The central tendencies of train and test scores are significantly different, indicating potential issues with the train/test split."
        )
    else:
        interpretation = (
            "The central tendencies of train and test scores are not significantly different, suggesting the train/test split is likely appropriate."
        )

    return {
        "Test Method": test_method,
        "Statistic Value": stat,
        "p-value": p_value,
        "Significance": significance,
        "Interpretation": interpretation,
        "N Models Evaluated": len(algorithms),
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=42)

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform diagnostic evaluation
    results = evaluate_train_test_split(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
Test Method: Mann-Whitney U Test
Statistic Value: 31.0
p-value: 0.9579672518055619
Significance: not significant
Interpretation: The central tendencies of train and test scores are not significantly different, suggesting the train/test split is likely appropriate.
N Models Evaluated: 8

Check Performance Correlation Compare Score Distributions