Compare Distribution Divergence
Procedure
This procedure evaluates whether the train and test sets have statistically similar data distributions for numerical variables, using divergence metrics like KL-divergence or JS-divergence.
-
Visualize Data Distributions
- What to do: Compare the numerical variables in the train and test sets using visual methods.
- Create histograms or kernel density plots for each variable in both sets.
- Overlay the train and test distributions to visually assess alignment.
- What to do: Compare the numerical variables in the train and test sets using visual methods.
-
Choose Divergence Metric
- What to do: Decide which divergence metric to use to quantify the difference in distributions.
- Use KL-divergence (Kullback-Leibler) if both distributions are non-zero over the same range.
- Use JS-divergence (Jensen-Shannon) for more robust comparison, as it works even when one distribution has zero values in some bins.
- Ensure both distributions are normalized to sum to 1 before applying these metrics.
- What to do: Decide which divergence metric to use to quantify the difference in distributions.
-
Calculate Divergence
- What to do: Compute the chosen divergence metric for each numerical variable.
- For KL-divergence, calculate the sum of $ p(x) \cdot \log\left(\frac{p(x)}{q(x)}\right) $, where $ p(x) $ and $ q(x) $ are the train and test distributions, respectively.
- For JS-divergence, calculate the average divergence between the train and test distributions and their mean distribution.
- Use available libraries like
scipy
ornumpy
for efficient computation.
- What to do: Compute the chosen divergence metric for each numerical variable.
-
Interpret the Divergence Values
- What to do: Evaluate the divergence scores to assess similarity between train and test distributions.
- A KL-divergence or JS-divergence value close to 0 indicates highly similar distributions.
- Higher values suggest significant differences, which may impact model performance if the variable is critical.
- What to do: Evaluate the divergence scores to assess similarity between train and test distributions.
-
Report the Findings
- What to do: Summarize the results and implications of the divergence analysis.
- Highlight variables with high divergence and recommend actions (e.g., rebalancing or feature engineering).
- Include both quantitative divergence scores and supporting visualizations to convey findings clearly.
- What to do: Summarize the results and implications of the divergence analysis.
Code Example
This Python function evaluates whether the train and test sets have similar data distributions for numerical variables using divergence metrics like KL-divergence and JS-divergence.
import numpy as np
from scipy.stats import entropy
def distribution_divergence_test(X_train, X_test, method='JS', bins=30):
"""
Compute the divergence between the train and test sets for numerical features using KL or JS divergence.
Parameters:
X_train (np.ndarray): Train set features.
X_test (np.ndarray): Test set features.
method (str): Divergence metric to use ('KL' for KL-divergence, 'JS' for JS-divergence).
bins (int): Number of bins for histogram approximation.
Returns:
dict: Dictionary with divergence results and interpretation for each feature.
"""
def js_divergence(p, q):
"""Compute Jensen-Shannon Divergence."""
m = 0.5 * (p + q)
return 0.5 * (entropy(p, m) + entropy(q, m))
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Compute histograms for probability distributions
train_hist, bin_edges = np.histogram(train_feature, bins=bins, density=True)
test_hist, _ = np.histogram(test_feature, bins=bin_edges, density=True)
# Normalize to sum to 1
train_hist = train_hist / train_hist.sum()
test_hist = test_hist / test_hist.sum()
# Compute divergence
if method == 'KL':
divergence = entropy(train_hist, test_hist)
elif method == 'JS':
divergence = js_divergence(train_hist, test_hist)
else:
raise ValueError("Unsupported method. Use 'KL' or 'JS'.")
# Interpret divergence
interpretation = (
"Train and test distributions are similar."
if divergence < 0.1
else "Train and test distributions differ significantly."
)
# Store results
results[f"Feature {feature_idx}"] = {
"Divergence": divergence,
"Interpretation": interpretation
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5)) # Slightly shifted test set
# Perform the diagnostic test
results = distribution_divergence_test(X_train, X_test, method='JS', bins=30)
# Print results
print("Distribution Divergence Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Distribution Divergence Test Results:
Feature 0:
- Divergence: 0.11693541778494154
- Interpretation: Train and test distributions differ significantly.
Feature 1:
- Divergence: 0.08353389212775253
- Interpretation: Train and test distributions are similar.
Feature 2:
- Divergence: 0.07464652530370389
- Interpretation: Train and test distributions are similar.
Feature 3:
- Divergence: 0.05511251555181266
- Interpretation: Train and test distributions are similar.
Feature 4:
- Divergence: 0.08911212241602645
- Interpretation: Train and test distributions are similar.