Compare Numerical Input-Input Pairwise Correlations
Do numerical input/target variables have the same pairwise correlations?
Procedure
This procedure evaluates whether the pair-wise correlations between numerical input variable distributions are consistent between train and test sets.
-
Calculate Pair-Wise Correlations
- What to do: Compute the correlation coefficients for numerical variables in both the train and test sets.
- Use Pearson’s correlation for linear relationships.
- Use Spearman’s rank correlation for monotonic relationships if non-linear relationships are expected.
- Create correlation matrices for both train and test datasets for clarity.
- What to do: Compute the correlation coefficients for numerical variables in both the train and test sets.
-
Define a Threshold for Consistency
- What to do: Determine the acceptable threshold for correlation differences.
- A typical threshold might be an absolute difference of 0.1 to 0.2, depending on the specific use case and dataset.
- Ensure the threshold is set before performing further analysis to avoid bias.
- What to do: Determine the acceptable threshold for correlation differences.
-
Compare Correlation Matrices
- What to do: Evaluate pair-wise correlations across the train and test sets for each variable pair.
- Subtract the correlation values of the test set from those of the train set for each pair of variables.
- Identify the variable pairs where the absolute difference exceeds the predefined threshold.
- What to do: Evaluate pair-wise correlations across the train and test sets for each variable pair.
-
Perform Statistical Test (Optional)
- What to do: Apply a statistical test to confirm differences in correlation coefficients if required.
- Use Fisher’s z-test to assess the significance of differences in correlation coefficients.
- Record the p-values to determine if observed differences are statistically significant.
- What to do: Apply a statistical test to confirm differences in correlation coefficients if required.
-
Interpret the Results
- What to do: Assess whether the train and test sets have consistent pair-wise correlations.
- Significant deviations in correlations suggest differences in relationships between variables, which may impact model performance.
- Non-significant deviations within the threshold imply consistent relationships.
- What to do: Assess whether the train and test sets have consistent pair-wise correlations.
-
Report the Findings
- What to do: Document the comparison results and any identified discrepancies.
- Highlight variable pairs with significant correlation differences.
- Provide an explanation of the threshold and statistical tests used.
- Include recommendations for further data processing or adjustments if inconsistencies are found.
- What to do: Document the comparison results and any identified discrepancies.
Code Example
This Python function checks whether the pair-wise correlations between numerical input variable distributions in the train and test sets are consistent and provides an interpretation of the results.
import numpy as np
import pandas as pd
from scipy.stats import spearmanr, pearsonr, fisher_exact
def correlation_consistency_test(X_train, X_test, threshold=0.1, method="pearson"):
"""
Test whether pair-wise correlations between numerical input variables
in the train and test sets are consistent.
Parameters:
X_train (np.ndarray): Train set features (2D array).
X_test (np.ndarray): Test set features (2D array).
threshold (float): Threshold for acceptable correlation differences (default=0.1).
method (str): Correlation method, "pearson" or "spearman" (default="pearson").
Returns:
dict: Dictionary with test results and interpretation for each variable pair.
"""
if method not in ["pearson", "spearman"]:
raise ValueError("Method must be 'pearson' or 'spearman'.")
results = {}
n_features = X_train.shape[1]
for i in range(n_features):
for j in range(i + 1, n_features):
# Calculate correlations
if method == "pearson":
corr_train, _ = pearsonr(X_train[:, i], X_train[:, j])
corr_test, _ = pearsonr(X_test[:, i], X_test[:, j])
elif method == "spearman":
corr_train, _ = spearmanr(X_train[:, i], X_train[:, j])
corr_test, _ = spearmanr(X_test[:, i], X_test[:, j])
# Calculate absolute difference
diff = abs(corr_train - corr_test)
# Interpretation
consistency = "consistent" if diff <= threshold else "inconsistent"
interpretation = (
f"Pair ({i}, {j}) has {consistency} correlations: "
f"Train={corr_train:.2f}, Test={corr_test:.2f}, Diff={diff:.2f}."
)
# Store results
results[f"Pair ({i}, {j})"] = {
"Train Correlation": corr_train,
"Test Correlation": corr_test,
"Difference": diff,
"Threshold": threshold,
"Consistency": consistency,
"Interpretation": interpretation,
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(size=(100, 4))
X_test = X_train + np.random.normal(scale=0.1, size=(100, 4)) # Slightly perturbed
# Perform the diagnostic test
results = correlation_consistency_test(X_train, X_test, threshold=0.1, method="pearson")
# Print results
print("Correlation Consistency Test Results:")
for pair, result in results.items():
print(f"{pair}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Correlation Consistency Test Results:
Pair (0, 1):
- Train Correlation: -0.016389923345609084
- Test Correlation: -0.028775165433341975
- Difference: 0.012385242087732892
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (0, 1) has consistent correlations: Train=-0.02, Test=-0.03, Diff=0.01.
Pair (0, 2):
- Train Correlation: 0.01774451545446517
- Test Correlation: 0.03387829754582042
- Difference: 0.01613378209135525
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (0, 2) has consistent correlations: Train=0.02, Test=0.03, Diff=0.02.
Pair (0, 3):
- Train Correlation: -0.019994328273540824
- Test Correlation: 0.006216470266065449
- Difference: 0.026210798539606273
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (0, 3) has consistent correlations: Train=-0.02, Test=0.01, Diff=0.03.
Pair (1, 2):
- Train Correlation: -0.03366825983055228
- Test Correlation: -0.018357623787524795
- Difference: 0.015310636043027483
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (1, 2) has consistent correlations: Train=-0.03, Test=-0.02, Diff=0.02.
Pair (1, 3):
- Train Correlation: 0.136855702531469
- Test Correlation: 0.13430096961879714
- Difference: 0.0025547329126718588
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (1, 3) has consistent correlations: Train=0.14, Test=0.13, Diff=0.00.
Pair (2, 3):
- Train Correlation: -0.03806988827026863
- Test Correlation: -0.06319059547880582
- Difference: 0.025120707208537187
- Threshold: 0.1
- Consistency: consistent
- Interpretation: Pair (2, 3) has consistent correlations: Train=-0.04, Test=-0.06, Diff=0.03.