Compare Distribution of Missing Values

Diagnostics

Procedure

This procedure evaluates whether the train and test sets have a similar pattern of missing values using statistical tests.

Calculate Missing Values
- What to do: Compute the count or percentage of missing values for each feature in both the train and test sets.
  - Use pandas’ .isnull().sum() or .isnull().mean() to summarize missing values.
  - Ensure that the missing value metrics are calculated feature-wise for consistency.
Test for Normality of Missing Value Distribution
- What to do: Assess whether the missing value counts or percentages follow a normal distribution in each dataset.
  - Apply the Shapiro-Wilk or Anderson-Darling test for smaller datasets.
  - Use the Kolmogorov-Smirnov test for larger datasets.
  - Visualize the distributions using histograms or Q-Q plots for clarity.
Choose the Statistical Test
- What to do: Select the appropriate statistical test based on the normality of the distributions.
  - If both distributions are normal, use a t-test to compare means.
  - If at least one distribution is non-normal, apply the Mann-Whitney U test to compare medians.
  - Ensure the test considers missing values feature-wise.
Perform the Statistical Test
- What to do: Conduct the chosen statistical test for each feature individually.
  - Record the p-value for each feature’s test result to identify significant differences.
  - If multiple features are tested, consider applying a correction method (e.g., Bonferroni) to control for false positives.
Interpret the Results
- What to do: Analyze the p-values to determine whether the patterns of missing values are statistically similar.
  - A p-value less than the significance level (e.g., 0.05) indicates a significant difference in the pattern of missing values for that feature.
  - Larger p-values suggest no significant difference, meaning the patterns of missing values may be similar between the datasets.
Report the Findings
- What to do: Summarize and document the results of the test in a clear format.
  - Highlight which features showed significant differences and their potential implications for model training.
  - Include details about the statistical tests used, the significance level, and any adjustments made for multiple comparisons.

Code Example

This Python function checks whether the train and test sets have similar patterns of missing values using statistical tests and provides an interpretation of the results.

import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu, shapiro

def missing_value_pattern_test(X_train, X_test, alpha=0.05):
    """
    Test whether the train and test sets have similar patterns of missing values.

    Parameters:
        X_train (np.ndarray): Train set with missing value indicators (1 for missing, 0 for not missing).
        X_test (np.ndarray): Test set with missing value indicators (1 for missing, 0 for not missing).
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Check normality
        _, p_train_normal = shapiro(train_feature)
        _, p_test_normal = shapiro(test_feature)

        if p_train_normal > alpha and p_test_normal > alpha:
            # Both distributions are normal, use t-test
            stat, p_value = ttest_ind(train_feature, test_feature)
            method = "t-Test"
        else:
            # At least one distribution is non-normal, use Mann-Whitney U test
            stat, p_value = mannwhitneyu(train_feature, test_feature)
            method = "Mann-Whitney U Test"

        # Interpret significance
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            "Train and test missing value patterns differ significantly."
            if p_value <= alpha
            else "No significant difference in missing value patterns."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Method": method,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data for missing values (1 for missing, 0 for not missing)
    np.random.seed(42)
    X_train = np.random.choice([0, 1], size=(100, 5), p=[0.8, 0.2])  # Train set with 20% missing values
    X_test = np.random.choice([0, 1], size=(100, 5), p=[0.7, 0.3])  # Test set with 30% missing values

    # Perform the diagnostic test
    results = missing_value_pattern_test(X_train, X_test, alpha=0.05)

    # Print results
    print("Missing Value Pattern Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Missing Value Pattern Test Results:
Feature 0:
  - Method: Mann-Whitney U Test
  - P-Value: 0.14572026413023997
  - Significance: not significant
  - Interpretation: No significant difference in missing value patterns.
Feature 1:
  - Method: Mann-Whitney U Test
  - P-Value: 0.32920599823022667
  - Significance: not significant
  - Interpretation: No significant difference in missing value patterns.
Feature 2:
  - Method: Mann-Whitney U Test
  - P-Value: 0.49689906228869174
  - Significance: not significant
  - Interpretation: No significant difference in missing value patterns.
Feature 3:
  - Method: Mann-Whitney U Test
  - P-Value: 0.6327495039059993
  - Significance: not significant
  - Interpretation: No significant difference in missing value patterns.
Feature 4:
  - Method: Mann-Whitney U Test
  - P-Value: 0.299600746146817
  - Significance: not significant
  - Interpretation: No significant difference in missing value patterns.

Compare Distribution of Univariate Outliers