Compare Numerical Input-Target Pairwise Correlations

Diagnostics

Procedure

This procedure evaluates whether the pair-wise correlations between numerical input variables and the numerical target variable are consistent across train and test sets.

Test for Normality
- What to do: Determine whether the numerical variables in both train and test sets follow a normal distribution.
  - Use tests like Shapiro-Wilk or Anderson-Darling for small datasets.
  - For larger datasets, consider using the Kolmogorov-Smirnov test.
  - Supplement with visualizations such as histograms or Q-Q plots for a clearer interpretation.
Select Correlation Measure
- What to do: Choose an appropriate correlation metric based on the data distributions.
  - If variables are normally distributed, use Pearson’s correlation.
  - If variables are not normally distributed or contain outliers, use Spearman’s rank correlation.
  - Ensure the same metric is applied across both train and test sets for consistency.
Compute Correlations
- What to do: Calculate pair-wise correlations between numerical input variables and the numerical target variable for both train and test sets.
  - Perform this computation independently for each dataset.
  - Record the correlation coefficients for each variable.
Compare Correlations
- What to do: Evaluate the differences in correlations between the train and test sets.
  - Use a threshold difference (e.g., 0.1) to identify significant deviations.
  - Alternatively, apply Fisher’s z-test to assess whether correlation differences are statistically significant.
Interpret the Results
- What to do: Analyze the findings to determine if correlation structures are consistent across the datasets.
  - Large differences or statistically significant deviations indicate potential data drift or mismatches in relationships.
  - Consistent correlations suggest the relationships between inputs and the target variable are stable.
Report the Findings
- What to do: Summarize the results of the correlation analysis in a clear format.
  - Highlight input variables with significant correlation differences.
  - Include any statistical test results, thresholds, and implications for model training and evaluation.
  - Provide recommendations for addressing inconsistencies, such as feature engineering or re-splitting the datasets.

Code Example

This Python function checks whether the pair-wise correlations between numerical input variables and the numerical target variable are consistent across train and test sets, using statistical tests and providing an interpretation of the results.

import numpy as np
from scipy.stats import pearsonr, spearmanr, fisher_exact

def correlation_consistency_test(X_train, y_train, X_test, y_test, alpha=0.05, correlation_type='pearson'):
    """
    Test whether the pair-wise correlations between features and target variable
    are consistent across train and test sets.

    Parameters:
        X_train (np.ndarray): Training set features.
        y_train (np.ndarray): Training set target variable.
        X_test (np.ndarray): Test set features.
        y_test (np.ndarray): Test set target variable.
        alpha (float): P-value threshold for statistical significance (default=0.05).
        correlation_type (str): Type of correlation ('pearson' or 'spearman').

    Returns:
        dict: Dictionary with correlation test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Compute correlations
        if correlation_type == 'pearson':
            corr_train, _ = pearsonr(train_feature, y_train)
            corr_test, _ = pearsonr(test_feature, y_test)
        elif correlation_type == 'spearman':
            corr_train, _ = spearmanr(train_feature, y_train)
            corr_test, _ = spearmanr(test_feature, y_test)
        else:
            raise ValueError("correlation_type must be 'pearson' or 'spearman'.")

        # Fisher's z-test for correlation difference
        n_train, n_test = len(train_feature), len(test_feature)
        z_score = (np.arctanh(corr_train) - np.arctanh(corr_test)) / np.sqrt(1 / (n_train - 3) + 1 / (n_test - 3))
        p_value = 2 * (1 - abs(z_score))  # Two-tailed p-value approximation

        # Interpret significance
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            "Correlations differ significantly between train and test sets."
            if p_value <= alpha
            else "Correlations are consistent between train and test sets."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Train Correlation": corr_train,
            "Test Correlation": corr_test,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    y_train = X_train[:, 0] + np.random.normal(scale=0.1, size=100)  # Linear relation with first feature
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))
    y_test = X_test[:, 0] + np.random.normal(scale=0.1, size=100)  # Slightly shifted relation

    # Perform the diagnostic test
    results = correlation_consistency_test(X_train, y_train, X_test, y_test, alpha=0.05, correlation_type='pearson')

    # Print results
    print("Correlation Consistency Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Correlation Consistency Test Results:
Feature 0:
  - Train Correlation: 0.9947941342091906
  - Test Correlation: 0.995328628770332
  - P-Value: 1.2436795913758991
  - Significance: not significant
  - Interpretation: Correlations are consistent between train and test sets.
Feature 1:
  - Train Correlation: -0.14822389186348994
  - Test Correlation: -0.008945320621011034
  - P-Value: 0.044755426406555054
  - Significance: significant
  - Interpretation: Correlations differ significantly between train and test sets.
Feature 2:
  - Train Correlation: 0.044421156668429614
  - Test Correlation: 0.036817892345752216
  - P-Value: 1.893923247048029
  - Significance: not significant
  - Interpretation: Correlations are consistent between train and test sets.
Feature 3:
  - Train Correlation: -0.08417504699877355
  - Test Correlation: -0.006521733574076547
  - P-Value: 0.9156349125822021
  - Significance: not significant
  - Interpretation: Correlations are consistent between train and test sets.
Feature 4:
  - Train Correlation: -0.12584985789547634
  - Test Correlation: 0.10986809008428138
  - P-Value: -1.2987165696514302
  - Significance: significant
  - Interpretation: Correlations differ significantly between train and test sets.

Compare Numerical Input-Input Pairwise Correlations Compare Distribution Divergence