Check Performance Correlation

Diagnostics

Procedure

This procedure evaluates whether the model performance on the train and test sets is consistent, ensuring that the distributions of performance metrics are comparable. This helps verify if the train/test split effectively generalizes the model to new data.

Evaluate Performance on Train Set
- What to do: Apply each algorithm to the train set using the chosen test harness.
  - Use methods like 10-fold cross-validation to evaluate performance consistently.
  - Record the performance metric (e.g., accuracy, RMSE) for each algorithm.
Evaluate Performance on Test Set
- What to do: Apply the same algorithms to the test set.
  - Use the same performance metric as in the train set evaluation.
  - Ensure no data leakage by strictly separating train and test data.
Gather Performance Scores
- What to do: Collect the performance metrics from both train and test evaluations.
  - Organize the scores in a structured format, such as a table with rows for algorithms and columns for train and test scores.
Check Distribution Assumptions
- What to do: Analyze the distributions of performance scores for the train and test sets.
  - If the scores are normally distributed, use Pearson’s correlation coefficient.
  - If non-normal, use Spearman’s rank correlation.
Calculate Correlation Between Train and Test Scores
- What to do: Compute the correlation coefficient for performance scores across train and test sets.
  - High correlation indicates consistent generalization between train and test data.
  - Low correlation may suggest issues with the train/test split or overfitting.
Interpret the Results
- What to do: Assess the correlation result in the context of generalization.
  - Comment on whether the train and test performance metrics align as expected.
  - Identify potential reasons for discrepancies, such as data leakage, insufficient train data, or algorithm bias.
Report the Findings
- What to do: Summarize the test outcomes and implications.
  - Present the correlation value alongside the individual train and test scores for each algorithm.
  - Highlight any algorithms with significant divergence and suggest next steps, such as revisiting the train/test split or improving data preprocessing.

Code Example

This Python function evaluates the quality of the train/test split by analyzing the correlation of model performance metrics across a suite of machine learning algorithms.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from scipy.stats import spearmanr, pearsonr, shapiro

def evaluate_train_test_split(X_train, y_train, X_test, y_test, metric, k):
    """
    Evaluate the quality of the train/test split by comparing model performance distributions.

    Parameters:
        X_train, y_train: training features and target.
        X_test, y_test: testing features and target.
        metric (callable): Performance metric function.
        k (int): Number of folds for k-fold cross-validation.

    Returns:
        dict: Results including correlation between train and test performance and interpretation.
    """

    # Define a suite of algorithms to evaluate
    algorithms = [
        LogisticRegression(max_iter=1000),
        DecisionTreeClassifier(),
        KNeighborsClassifier(),
        SVC(probability=True),
        RandomForestClassifier(n_estimators=50),
        GradientBoostingClassifier(),
        DummyClassifier(strategy="most_frequent"),
        DummyClassifier(strategy="stratified"),
    ]

    train_scores = []
    test_scores = []

    # Evaluate each algorithm
    for model in algorithms:
        # Cross-validate on train set
        train_cv_scores = cross_val_score(model, X_train, y_train, scoring=make_scorer(metric), cv=k)
        train_scores.append(np.mean(train_cv_scores))

        # Cross-validate on test set
        test_cv_scores = cross_val_score(model, X_test, y_test, scoring=make_scorer(metric), cv=k)
        test_scores.append(np.mean(test_cv_scores))

    # Check normality of residuals
    _, p_train_normal = shapiro(train_scores)
    _, p_test_normal = shapiro(test_scores)

    # Choose appropriate correlation method
    alpha = 0.05
    if p_train_normal > alpha and p_test_normal > alpha:
        method = "Pearson"
        correlation, p_value = pearsonr(train_scores, test_scores)
    else:
        method = "Spearman"
        correlation, p_value = spearmanr(train_scores, test_scores)

    # Interpret the results
    significance = "significant" if p_value <= alpha else "not significant"
    if correlation > 0.8:
        interpretation = "Good train/test split: high correlation between train and test performance."
    elif 0.5 < correlation <= 0.8:
        interpretation = "Moderate train/test split: some divergence in train and test performance."
    else:
        interpretation = "Poor train/test split: low correlation indicates issues in generalization."

    return {
        "N": len(algorithms),
        "Method": method,
        "Significance": significance,
        "p-value": p_value,
        "Correlation": correlation,
        "Interpretation": interpretation
    }

# Demo Usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Perform the diagnostic test
    results = evaluate_train_test_split(
        X_train, y_train,
        X_test, y_test,
        metric=accuracy_score,
        k=10
    )

    # Print results
    print("Train/Test Split Diagnostic Results:")
    for key, value in results.items():
        print(f"{key}: {value}")

Example Output

Train/Test Split Diagnostic Results:
N: 8
Method: Spearman
Significance: significant
p-value: 5.296153515645123e-07
Correlation: 0.9940297973880048
Interpretation: Good train/test split: high correlation between train and test performance.

Check Performance Central Tendencies