Compare Distribution of Missing Values
Procedure
This procedure evaluates whether the train and test sets have a similar pattern of missing values using statistical tests.
-
Calculate Missing Values
- What to do: Compute the count or percentage of missing values for each feature in both the train and test sets.
- Use pandas’
.isnull().sum()
or.isnull().mean()
to summarize missing values. - Ensure that the missing value metrics are calculated feature-wise for consistency.
- Use pandas’
- What to do: Compute the count or percentage of missing values for each feature in both the train and test sets.
-
Test for Normality of Missing Value Distribution
- What to do: Assess whether the missing value counts or percentages follow a normal distribution in each dataset.
- Apply the Shapiro-Wilk or Anderson-Darling test for smaller datasets.
- Use the Kolmogorov-Smirnov test for larger datasets.
- Visualize the distributions using histograms or Q-Q plots for clarity.
- What to do: Assess whether the missing value counts or percentages follow a normal distribution in each dataset.
-
Choose the Statistical Test
- What to do: Select the appropriate statistical test based on the normality of the distributions.
- If both distributions are normal, use a t-test to compare means.
- If at least one distribution is non-normal, apply the Mann-Whitney U test to compare medians.
- Ensure the test considers missing values feature-wise.
- What to do: Select the appropriate statistical test based on the normality of the distributions.
-
Perform the Statistical Test
- What to do: Conduct the chosen statistical test for each feature individually.
- Record the p-value for each feature’s test result to identify significant differences.
- If multiple features are tested, consider applying a correction method (e.g., Bonferroni) to control for false positives.
- What to do: Conduct the chosen statistical test for each feature individually.
-
Interpret the Results
- What to do: Analyze the p-values to determine whether the patterns of missing values are statistically similar.
- A p-value less than the significance level (e.g., 0.05) indicates a significant difference in the pattern of missing values for that feature.
- Larger p-values suggest no significant difference, meaning the patterns of missing values may be similar between the datasets.
- What to do: Analyze the p-values to determine whether the patterns of missing values are statistically similar.
-
Report the Findings
- What to do: Summarize and document the results of the test in a clear format.
- Highlight which features showed significant differences and their potential implications for model training.
- Include details about the statistical tests used, the significance level, and any adjustments made for multiple comparisons.
- What to do: Summarize and document the results of the test in a clear format.
Code Example
This Python function checks whether the train and test sets have similar patterns of missing values using statistical tests and provides an interpretation of the results.
import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu, shapiro
def missing_value_pattern_test(X_train, X_test, alpha=0.05):
"""
Test whether the train and test sets have similar patterns of missing values.
Parameters:
X_train (np.ndarray): Train set with missing value indicators (1 for missing, 0 for not missing).
X_test (np.ndarray): Test set with missing value indicators (1 for missing, 0 for not missing).
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each feature.
"""
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Check normality
_, p_train_normal = shapiro(train_feature)
_, p_test_normal = shapiro(test_feature)
if p_train_normal > alpha and p_test_normal > alpha:
# Both distributions are normal, use t-test
stat, p_value = ttest_ind(train_feature, test_feature)
method = "t-Test"
else:
# At least one distribution is non-normal, use Mann-Whitney U test
stat, p_value = mannwhitneyu(train_feature, test_feature)
method = "Mann-Whitney U Test"
# Interpret significance
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test missing value patterns differ significantly."
if p_value <= alpha
else "No significant difference in missing value patterns."
)
# Store results
results[f"Feature {feature_idx}"] = {
"Method": method,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data for missing values (1 for missing, 0 for not missing)
np.random.seed(42)
X_train = np.random.choice([0, 1], size=(100, 5), p=[0.8, 0.2]) # Train set with 20% missing values
X_test = np.random.choice([0, 1], size=(100, 5), p=[0.7, 0.3]) # Test set with 30% missing values
# Perform the diagnostic test
results = missing_value_pattern_test(X_train, X_test, alpha=0.05)
# Print results
print("Missing Value Pattern Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Missing Value Pattern Test Results:
Feature 0:
- Method: Mann-Whitney U Test
- P-Value: 0.14572026413023997
- Significance: not significant
- Interpretation: No significant difference in missing value patterns.
Feature 1:
- Method: Mann-Whitney U Test
- P-Value: 0.32920599823022667
- Significance: not significant
- Interpretation: No significant difference in missing value patterns.
Feature 2:
- Method: Mann-Whitney U Test
- P-Value: 0.49689906228869174
- Significance: not significant
- Interpretation: No significant difference in missing value patterns.
Feature 3:
- Method: Mann-Whitney U Test
- P-Value: 0.6327495039059993
- Significance: not significant
- Interpretation: No significant difference in missing value patterns.
Feature 4:
- Method: Mann-Whitney U Test
- P-Value: 0.299600746146817
- Significance: not significant
- Interpretation: No significant difference in missing value patterns.