Compare Distribution Divergence

Compare Distribution Divergence

Procedure

This procedure evaluates whether the train and test sets have statistically similar data distributions for numerical variables, using divergence metrics like KL-divergence or JS-divergence.

  1. Visualize Data Distributions

    • What to do: Compare the numerical variables in the train and test sets using visual methods.
      • Create histograms or kernel density plots for each variable in both sets.
      • Overlay the train and test distributions to visually assess alignment.
  2. Choose Divergence Metric

    • What to do: Decide which divergence metric to use to quantify the difference in distributions.
      • Use KL-divergence (Kullback-Leibler) if both distributions are non-zero over the same range.
      • Use JS-divergence (Jensen-Shannon) for more robust comparison, as it works even when one distribution has zero values in some bins.
      • Ensure both distributions are normalized to sum to 1 before applying these metrics.
  3. Calculate Divergence

    • What to do: Compute the chosen divergence metric for each numerical variable.
      • For KL-divergence, calculate the sum of $ p(x) \cdot \log\left(\frac{p(x)}{q(x)}\right) $, where $ p(x) $ and $ q(x) $ are the train and test distributions, respectively.
      • For JS-divergence, calculate the average divergence between the train and test distributions and their mean distribution.
      • Use available libraries like scipy or numpy for efficient computation.
  4. Interpret the Divergence Values

    • What to do: Evaluate the divergence scores to assess similarity between train and test distributions.
      • A KL-divergence or JS-divergence value close to 0 indicates highly similar distributions.
      • Higher values suggest significant differences, which may impact model performance if the variable is critical.
  5. Report the Findings

    • What to do: Summarize the results and implications of the divergence analysis.
      • Highlight variables with high divergence and recommend actions (e.g., rebalancing or feature engineering).
      • Include both quantitative divergence scores and supporting visualizations to convey findings clearly.

Code Example

This Python function evaluates whether the train and test sets have similar data distributions for numerical variables using divergence metrics like KL-divergence and JS-divergence.

import numpy as np
from scipy.stats import entropy

def distribution_divergence_test(X_train, X_test, method='JS', bins=30):
    """
    Compute the divergence between the train and test sets for numerical features using KL or JS divergence.

    Parameters:
        X_train (np.ndarray): Train set features.
        X_test (np.ndarray): Test set features.
        method (str): Divergence metric to use ('KL' for KL-divergence, 'JS' for JS-divergence).
        bins (int): Number of bins for histogram approximation.

    Returns:
        dict: Dictionary with divergence results and interpretation for each feature.
    """
    def js_divergence(p, q):
        """Compute Jensen-Shannon Divergence."""
        m = 0.5 * (p + q)
        return 0.5 * (entropy(p, m) + entropy(q, m))

    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Compute histograms for probability distributions
        train_hist, bin_edges = np.histogram(train_feature, bins=bins, density=True)
        test_hist, _ = np.histogram(test_feature, bins=bin_edges, density=True)

        # Normalize to sum to 1
        train_hist = train_hist / train_hist.sum()
        test_hist = test_hist / test_hist.sum()

        # Compute divergence
        if method == 'KL':
            divergence = entropy(train_hist, test_hist)
        elif method == 'JS':
            divergence = js_divergence(train_hist, test_hist)
        else:
            raise ValueError("Unsupported method. Use 'KL' or 'JS'.")

        # Interpret divergence
        interpretation = (
            "Train and test distributions are similar."
            if divergence < 0.1
            else "Train and test distributions differ significantly."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Divergence": divergence,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))  # Slightly shifted test set

    # Perform the diagnostic test
    results = distribution_divergence_test(X_train, X_test, method='JS', bins=30)

    # Print results
    print("Distribution Divergence Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Distribution Divergence Test Results:
Feature 0:
  - Divergence: 0.11693541778494154
  - Interpretation: Train and test distributions differ significantly.
Feature 1:
  - Divergence: 0.08353389212775253
  - Interpretation: Train and test distributions are similar.
Feature 2:
  - Divergence: 0.07464652530370389
  - Interpretation: Train and test distributions are similar.
Feature 3:
  - Divergence: 0.05511251555181266
  - Interpretation: Train and test distributions are similar.
Feature 4:
  - Divergence: 0.08911212241602645
  - Interpretation: Train and test distributions are similar.