Compare Numerical Distributions
Do numerical input/target variables have the same distributions?
Procedure
This procedure evaluates whether the numerical input and target variables in the train and test sets have the same distributions using statistical tests.
-
Choose the Distribution Comparison Test
- What to do: Select an appropriate test for comparing the distributions of the train and test sets.
- Use the Kolmogorov-Smirnov (KS) test to compare the cumulative distributions of two datasets.
- Alternatively, use the Anderson-Darling (AD) test for assessing the fit of the train and test sets to a common distribution.
- What to do: Select an appropriate test for comparing the distributions of the train and test sets.
-
Prepare the Data
- What to do: Ensure the data is formatted consistently for the test.
- Align numerical input features and target variables between the train and test sets.
- Remove any missing or inconsistent data to ensure comparability.
- What to do: Ensure the data is formatted consistently for the test.
-
Perform the Statistical Test
- What to do: Apply the selected statistical test to each numerical feature and the target variable.
- Use the KS test to compute the maximum difference in cumulative distributions for each feature.
- If using the AD test, compute the statistic to determine the overall similarity in distributions.
- What to do: Apply the selected statistical test to each numerical feature and the target variable.
-
Interpret the Results
- What to do: Analyze the p-values and test statistics to assess distribution differences.
- A p-value below the significance level (e.g., 0.05) indicates that the distributions are significantly different.
- A larger p-value suggests no significant difference, meaning the train and test sets may share similar distributions.
- What to do: Analyze the p-values and test statistics to assess distribution differences.
-
Report the Findings
- What to do: Summarize the results of the tests and their implications.
- Identify which variables have significantly different distributions.
- Document the test used, p-values, and conclusions for further analysis or preprocessing adjustments.
- What to do: Summarize the results of the tests and their implications.
Code Example
This Python function checks whether the train and test sets have the same distributions for numerical variables using statistical tests and provides an interpretation of the results.
import numpy as np
from scipy.stats import ks_2samp, anderson_ksamp
def distribution_test(X_train, X_test, alpha=0.05):
"""
Test whether the train and test sets have the same distributions for numerical variables.
Parameters:
X_train (np.ndarray): Train set features.
X_test (np.ndarray): Test set features.
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each feature.
"""
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Apply Kolmogorov-Smirnov test
ks_stat, ks_p_value = ks_2samp(train_feature, test_feature)
# Interpret significance
ks_significance = "significant" if ks_p_value <= alpha else "not significant"
ks_interpretation = (
"Train and test distributions differ significantly."
if ks_p_value <= alpha
else "No significant difference in distributions."
)
# Optionally apply Anderson-Darling k-sample test for additional validation
try:
ad_stat, ad_critical, ad_significance = anderson_ksamp([train_feature, test_feature])
ad_significance_result = "significant" if ad_significance < alpha else "not significant"
ad_interpretation = (
"Train and test distributions differ significantly."
if ad_significance < alpha
else "No significant difference in distributions."
)
except ValueError:
# Fallback if AD test fails for small or unsuitable data
ad_stat, ad_significance, ad_interpretation = None, None, "AD test not applicable."
# Store results for this feature
results[f"Feature {feature_idx}"] = {
"KS Statistic": ks_stat,
"KS P-Value": ks_p_value,
"KS Significance": ks_significance,
"KS Interpretation": ks_interpretation,
"AD Statistic": ad_stat,
"AD Significance Level": ad_significance,
"AD Interpretation": ad_interpretation,
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5)) # Slightly shifted test set
# Perform the diagnostic test
results = distribution_test(X_train, X_test, alpha=0.05)
# Print results
print("Distribution Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Distribution Test Results:
Feature 0:
- KS Statistic: 0.18
- KS P-Value: 0.07822115797841851
- KS Significance: not significant
- KS Interpretation: No significant difference in distributions.
- AD Statistic: 2.4790056068311164
- AD Significance Level: 0.03126256937448677
- AD Interpretation: Train and test distributions differ significantly.
Feature 1:
- KS Statistic: 0.11
- KS P-Value: 0.5830090612540064
- KS Significance: not significant
- KS Interpretation: No significant difference in distributions.
- AD Statistic: -0.035833989133559854
- AD Significance Level: 0.25
- AD Interpretation: No significant difference in distributions.
Feature 2:
- KS Statistic: 0.2
- KS P-Value: 0.03638428787491733
- KS Significance: significant
- KS Interpretation: Train and test distributions differ significantly.
- AD Statistic: 2.890999741719102
- AD Significance Level: 0.021483812474621156
- AD Interpretation: Train and test distributions differ significantly.
Feature 3:
- KS Statistic: 0.14
- KS P-Value: 0.2819416298082479
- KS Significance: not significant
- KS Interpretation: No significant difference in distributions.
- AD Statistic: 0.45447025859205664
- AD Significance Level: 0.21605536724325053
- AD Interpretation: No significant difference in distributions.
Feature 4:
- KS Statistic: 0.2
- KS P-Value: 0.03638428787491733
- KS Significance: significant
- KS Interpretation: Train and test distributions differ significantly.
- AD Statistic: 4.127615824944391
- AD Significance Level: 0.007231582407157736
- AD Interpretation: Train and test distributions differ significantly.