Compare Numerical Variance

Diagnostics

Procedure

This procedure evaluates whether the numerical input and target variables in the train and test sets have the same variance.

Test for Normality
- What to do: Assess whether the data in the train and test sets is normally distributed.
  - Use the Shapiro-Wilk test or Anderson-Darling test for smaller datasets.
  - For larger datasets, use the Kolmogorov-Smirnov test.
  - Visualize distributions with histograms or Q-Q plots for additional context.
Choose the Appropriate Variance Test
- What to do: Select the appropriate test based on the normality of the data.
  - If both sets are normally distributed, use an F-test to compare variances.
  - If either set is not normally distributed, use Levene’s test or Bartlett’s test to compare variances.
  - Ensure the chosen test aligns with the data’s characteristics.
Perform the Variance Test
- What to do: Apply the selected test to the numerical variables in the train and test sets.
  - Conduct the test independently for each input feature and the target variable.
  - Record the test statistics and p-values for analysis.
Interpret the Results
- What to do: Analyze the p-values from the tests to determine if there is a significant difference in variances.
  - A p-value less than the chosen significance level (e.g., 0.05) suggests a statistically significant difference in variances.
  - A p-value greater than the significance level indicates no significant variance difference, suggesting the sets may be compatible.
Report the Findings
- What to do: Document and summarize the results of the variance tests.
  - Clearly state which variables showed significant variance differences and which did not.
  - Include a brief explanation of the test methodology and its implications for model training or preprocessing adjustments.

Code Example

This Python function checks whether the train and test sets have the same variance for numerical variables using statistical tests and provides an interpretation of the results.

import numpy as np
from scipy.stats import f_oneway, levene, bartlett, shapiro

def variance_test(X_train, X_test, alpha=0.05):
    """
    Test whether the train and test sets have the same variance for numerical variables.

    Parameters:
        X_train (np.ndarray): Train set features.
        X_test (np.ndarray): Test set features.
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Check normality
        _, p_train_normal = shapiro(train_feature)
        _, p_test_normal = shapiro(test_feature)

        if p_train_normal > alpha and p_test_normal > alpha:
            # Both distributions are normal, use Bartlett's test
            stat, p_value = bartlett(train_feature, test_feature)
            method = "Bartlett's Test"
        else:
            # At least one distribution is non-normal, use Levene's test
            stat, p_value = levene(train_feature, test_feature)
            method = "Levene's Test"

        # Interpret significance
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            "Train and test variances differ significantly."
            if p_value <= alpha
            else "No significant difference in variances."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Method": method,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    X_test = np.random.normal(loc=0, scale=1.2, size=(100, 5))  # Test set with different variance

    # Perform the diagnostic test
    results = variance_test(X_train, X_test, alpha=0.05)

    # Print results
    print("Variance Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Variance Test Results:
Feature 0:
  - Method: Bartlett's Test
  - P-Value: 0.011036107642909448
  - Significance: significant
  - Interpretation: Train and test variances differ significantly.
Feature 1:
  - Method: Bartlett's Test
  - P-Value: 0.03895357883463975
  - Significance: significant
  - Interpretation: Train and test variances differ significantly.
Feature 2:
  - Method: Bartlett's Test
  - P-Value: 0.16321431131321001
  - Significance: not significant
  - Interpretation: No significant difference in variances.
Feature 3:
  - Method: Bartlett's Test
  - P-Value: 0.045181731478819084
  - Significance: significant
  - Interpretation: Train and test variances differ significantly.
Feature 4:
  - Method: Bartlett's Test
  - P-Value: 0.35878313013231045
  - Significance: not significant
  - Interpretation: No significant difference in variances.

Compare Categorical Distributions Compare Numerical Input-Input Pairwise Correlations