Split Sensitivity Analysis By Data Correlation

Split Sensitivity Analysis By Data Correlation

Procedure

  1. Define Split Ratios:

    • Select various train/test split ratios to evaluate, such as 70/30, 80/20, and 90/10.
  2. Random Splitting:

    • For each split ratio, generate multiple random train/test splits to ensure robustness in the results.
  3. Correlation Computation:

    • Compute correlation coefficients (e.g., Pearson’s for linear relationships or Spearman’s for monotonic relationships) for numerical variables in the entire dataset.
    • For example:
      • Correlation between each feature and the numerical target variable
      • Correlation between each variable and every other variable within each set
      • Correlation of each variable in the train vs test set (note, samples have to be the same size)
  4. Correlation Comparison:

    • For each variable pair, compare the correlations between the train and test splits using a defined similarity threshold (e.g., a difference of less than 0.05 in the correlation values).
    • Calculate the percentage of variable pairs where the correlation similarity is maintained within this threshold.
  5. Assessment of Split Quality:

    • Evaluate each split ratio based on the consistency of correlations:
      • Identify the percentage of variable pairs where correlation values remain similar across train and test splits.
      • Summarize the results across all trials for each split ratio to determine the most consistent split.
  6. Reporting Results:

    • Report the split ratio with the highest percentage of consistent correlations across trials as the optimal split.
    • Include visualizations, if necessary, to demonstrate correlation consistency across splits.

Code Example

Below is a Python function that evaluates train/test split ratios based on the consistency of pairwise correlations of numerical variables of input variables with the target variable across splits.

import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr, shapiro
from sklearn.model_selection import train_test_split

def correlation_sensitivity_analysis(X, y, split_ratios, num_trials=30, correlation_threshold=0.05, alpha=0.05):
    """
    Perform sensitivity analysis on train/test split ratios to evaluate the consistency of numerical variable correlations.

    Parameters:
        X (np.ndarray): Feature matrix.
        y (np.ndarray): Target array.
        split_ratios (list of tuples): List of train/test split ratios (e.g., [(0.5, 0.5), (0.8, 0.2)]).
        num_trials (int): Number of trials for each ratio.
        correlation_threshold (float): Threshold for acceptable correlation differences.
        alpha (float): Significance level for normality tests.

    Returns:
        dict: Summary of correlation consistency results for each split ratio and each variable.
    """
    results = {}
    num_features = X.shape[1]

    # Perform normality test for each feature in the dataset
    normality = {}
    for i in range(num_features):
        p_value = shapiro(X[:, i])[1]
        normality[i] = p_value > alpha  # True if the variable is normally distributed

    for ratio in split_ratios:
        train_size, test_size = ratio
        ratio_results = {}

        for feature_idx in range(num_features):
            feature_consistency = []
            for _ in range(num_trials):
                # Perform random split
                X_train, X_test, y_train, y_test = train_test_split(
                    X, y, train_size=train_size, test_size=test_size
                )

                # Choose the appropriate correlation function
                if normality[feature_idx]:
                    corr_func = pearsonr
                else:
                    corr_func = spearmanr

                # Compute correlations for the feature with the target
                train_corr, _ = corr_func(X_train[:, feature_idx], y_train)
                test_corr, _ = corr_func(X_test[:, feature_idx], y_test)

                # Check if correlations are within the acceptable threshold
                consistent = abs(train_corr - test_corr) <= correlation_threshold
                feature_consistency.append(consistent)

            # Calculate percentage of consistent trials for the feature
            ratio_results[f"Feature_{feature_idx}"] = {
                "Normal": normality[feature_idx],
                "Consistency (%)": np.mean(feature_consistency) * 100,
            }

        # Store results for the split ratio
        results[ratio] = ratio_results

    return results


# Example usage of the function
if __name__ == "__main__":
    from sklearn.datasets import make_regression

    # Generate a synthetic regression dataset
    X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

    # Define split ratios to test
    split_ratios = [(0.5, 0.5), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1)]

    # Perform the sensitivity analysis
    results = correlation_sensitivity_analysis(
        X, y, split_ratios, num_trials=30, correlation_threshold=0.05, alpha=0.05
    )

    # Display the results
    for ratio, features in results.items():
        print(f"Split Ratio {ratio}:")
        for feature, metrics in features.items():
            print(
                f"  - {feature}: (normal: {metrics['Normal']}) {metrics['Consistency (%)']:.2f}% consistent trials"
            )
        print("-" * 40)

Example Output

Split Ratio (0.5, 0.5):
  - Feature_0: (normal: True) 60.00% consistent trials
  - Feature_1: (normal: True) 90.00% consistent trials
  - Feature_2: (normal: True) 66.67% consistent trials
  - Feature_3: (normal: True) 63.33% consistent trials
  - Feature_4: (normal: True) 66.67% consistent trials
----------------------------------------
Split Ratio (0.7, 0.3):
  - Feature_0: (normal: True) 50.00% consistent trials
  - Feature_1: (normal: True) 76.67% consistent trials
  - Feature_2: (normal: True) 56.67% consistent trials
  - Feature_3: (normal: True) 46.67% consistent trials
  - Feature_4: (normal: True) 53.33% consistent trials
----------------------------------------
Split Ratio (0.8, 0.2):
  - Feature_0: (normal: True) 60.00% consistent trials
  - Feature_1: (normal: True) 86.67% consistent trials
  - Feature_2: (normal: True) 66.67% consistent trials
  - Feature_3: (normal: True) 50.00% consistent trials
  - Feature_4: (normal: True) 53.33% consistent trials
----------------------------------------
Split Ratio (0.9, 0.1):
  - Feature_0: (normal: True) 40.00% consistent trials
  - Feature_1: (normal: True) 80.00% consistent trials
  - Feature_2: (normal: True) 40.00% consistent trials
  - Feature_3: (normal: True) 30.00% consistent trials
  - Feature_4: (normal: True) 46.67% consistent trials
----------------------------------------