Compare Numerical Input-Input Pairwise Correlations

Diagnostics

Do numerical input/target variables have the same pairwise correlations?

Procedure

This procedure evaluates whether the pair-wise correlations between numerical input variable distributions are consistent between train and test sets.

Calculate Pair-Wise Correlations
- What to do: Compute the correlation coefficients for numerical variables in both the train and test sets.
  - Use Pearson’s correlation for linear relationships.
  - Use Spearman’s rank correlation for monotonic relationships if non-linear relationships are expected.
  - Create correlation matrices for both train and test datasets for clarity.
Define a Threshold for Consistency
- What to do: Determine the acceptable threshold for correlation differences.
  - A typical threshold might be an absolute difference of 0.1 to 0.2, depending on the specific use case and dataset.
  - Ensure the threshold is set before performing further analysis to avoid bias.
Compare Correlation Matrices
- What to do: Evaluate pair-wise correlations across the train and test sets for each variable pair.
  - Subtract the correlation values of the test set from those of the train set for each pair of variables.
  - Identify the variable pairs where the absolute difference exceeds the predefined threshold.
Perform Statistical Test (Optional)
- What to do: Apply a statistical test to confirm differences in correlation coefficients if required.
  - Use Fisher’s z-test to assess the significance of differences in correlation coefficients.
  - Record the p-values to determine if observed differences are statistically significant.
Interpret the Results
- What to do: Assess whether the train and test sets have consistent pair-wise correlations.
  - Significant deviations in correlations suggest differences in relationships between variables, which may impact model performance.
  - Non-significant deviations within the threshold imply consistent relationships.
Report the Findings
- What to do: Document the comparison results and any identified discrepancies.
  - Highlight variable pairs with significant correlation differences.
  - Provide an explanation of the threshold and statistical tests used.
  - Include recommendations for further data processing or adjustments if inconsistencies are found.

Code Example

This Python function checks whether the pair-wise correlations between numerical input variable distributions in the train and test sets are consistent and provides an interpretation of the results.

import numpy as np
import pandas as pd
from scipy.stats import spearmanr, pearsonr, fisher_exact

def correlation_consistency_test(X_train, X_test, threshold=0.1, method="pearson"):
    """
    Test whether pair-wise correlations between numerical input variables
    in the train and test sets are consistent.

    Parameters:
        X_train (np.ndarray): Train set features (2D array).
        X_test (np.ndarray): Test set features (2D array).
        threshold (float): Threshold for acceptable correlation differences (default=0.1).
        method (str): Correlation method, "pearson" or "spearman" (default="pearson").

    Returns:
        dict: Dictionary with test results and interpretation for each variable pair.
    """
    if method not in ["pearson", "spearman"]:
        raise ValueError("Method must be 'pearson' or 'spearman'.")

    results = {}
    n_features = X_train.shape[1]

    for i in range(n_features):
        for j in range(i + 1, n_features):
            # Calculate correlations
            if method == "pearson":
                corr_train, _ = pearsonr(X_train[:, i], X_train[:, j])
                corr_test, _ = pearsonr(X_test[:, i], X_test[:, j])
            elif method == "spearman":
                corr_train, _ = spearmanr(X_train[:, i], X_train[:, j])
                corr_test, _ = spearmanr(X_test[:, i], X_test[:, j])

            # Calculate absolute difference
            diff = abs(corr_train - corr_test)

            # Interpretation
            consistency = "consistent" if diff <= threshold else "inconsistent"
            interpretation = (
                f"Pair ({i}, {j}) has {consistency} correlations: "
                f"Train={corr_train:.2f}, Test={corr_test:.2f}, Diff={diff:.2f}."
            )

            # Store results
            results[f"Pair ({i}, {j})"] = {
                "Train Correlation": corr_train,
                "Test Correlation": corr_test,
                "Difference": diff,
                "Threshold": threshold,
                "Consistency": consistency,
                "Interpretation": interpretation,
            }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(size=(100, 4))
    X_test = X_train + np.random.normal(scale=0.1, size=(100, 4))  # Slightly perturbed

    # Perform the diagnostic test
    results = correlation_consistency_test(X_train, X_test, threshold=0.1, method="pearson")

    # Print results
    print("Correlation Consistency Test Results:")
    for pair, result in results.items():
        print(f"{pair}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Correlation Consistency Test Results:
Pair (0, 1):
  - Train Correlation: -0.016389923345609084
  - Test Correlation: -0.028775165433341975
  - Difference: 0.012385242087732892
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (0, 1) has consistent correlations: Train=-0.02, Test=-0.03, Diff=0.01.
Pair (0, 2):
  - Train Correlation: 0.01774451545446517
  - Test Correlation: 0.03387829754582042
  - Difference: 0.01613378209135525
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (0, 2) has consistent correlations: Train=0.02, Test=0.03, Diff=0.02.
Pair (0, 3):
  - Train Correlation: -0.019994328273540824
  - Test Correlation: 0.006216470266065449
  - Difference: 0.026210798539606273
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (0, 3) has consistent correlations: Train=-0.02, Test=0.01, Diff=0.03.
Pair (1, 2):
  - Train Correlation: -0.03366825983055228
  - Test Correlation: -0.018357623787524795
  - Difference: 0.015310636043027483
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (1, 2) has consistent correlations: Train=-0.03, Test=-0.02, Diff=0.02.
Pair (1, 3):
  - Train Correlation: 0.136855702531469
  - Test Correlation: 0.13430096961879714
  - Difference: 0.0025547329126718588
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (1, 3) has consistent correlations: Train=0.14, Test=0.13, Diff=0.00.
Pair (2, 3):
  - Train Correlation: -0.03806988827026863
  - Test Correlation: -0.06319059547880582
  - Difference: 0.025120707208537187
  - Threshold: 0.1
  - Consistency: consistent
  - Interpretation: Pair (2, 3) has consistent correlations: Train=-0.04, Test=-0.06, Diff=0.03.

Compare Numerical Variance Compare Numerical Input-Target Pairwise Correlations