Compare Numerical Distributions

Diagnostics

Do numerical input/target variables have the same distributions?

Procedure

This procedure evaluates whether the numerical input and target variables in the train and test sets have the same distributions using statistical tests.

Choose the Distribution Comparison Test
- What to do: Select an appropriate test for comparing the distributions of the train and test sets.
  - Use the Kolmogorov-Smirnov (KS) test to compare the cumulative distributions of two datasets.
  - Alternatively, use the Anderson-Darling (AD) test for assessing the fit of the train and test sets to a common distribution.
Prepare the Data
- What to do: Ensure the data is formatted consistently for the test.
  - Align numerical input features and target variables between the train and test sets.
  - Remove any missing or inconsistent data to ensure comparability.
Perform the Statistical Test
- What to do: Apply the selected statistical test to each numerical feature and the target variable.
  - Use the KS test to compute the maximum difference in cumulative distributions for each feature.
  - If using the AD test, compute the statistic to determine the overall similarity in distributions.
Interpret the Results
- What to do: Analyze the p-values and test statistics to assess distribution differences.
  - A p-value below the significance level (e.g., 0.05) indicates that the distributions are significantly different.
  - A larger p-value suggests no significant difference, meaning the train and test sets may share similar distributions.
Report the Findings
- What to do: Summarize the results of the tests and their implications.
  - Identify which variables have significantly different distributions.
  - Document the test used, p-values, and conclusions for further analysis or preprocessing adjustments.

Code Example

This Python function checks whether the train and test sets have the same distributions for numerical variables using statistical tests and provides an interpretation of the results.

import numpy as np
from scipy.stats import ks_2samp, anderson_ksamp

def distribution_test(X_train, X_test, alpha=0.05):
    """
    Test whether the train and test sets have the same distributions for numerical variables.

    Parameters:
        X_train (np.ndarray): Train set features.
        X_test (np.ndarray): Test set features.
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Apply Kolmogorov-Smirnov test
        ks_stat, ks_p_value = ks_2samp(train_feature, test_feature)

        # Interpret significance
        ks_significance = "significant" if ks_p_value <= alpha else "not significant"
        ks_interpretation = (
            "Train and test distributions differ significantly."
            if ks_p_value <= alpha
            else "No significant difference in distributions."
        )

        # Optionally apply Anderson-Darling k-sample test for additional validation
        try:
            ad_stat, ad_critical, ad_significance = anderson_ksamp([train_feature, test_feature])
            ad_significance_result = "significant" if ad_significance < alpha else "not significant"
            ad_interpretation = (
                "Train and test distributions differ significantly."
                if ad_significance < alpha
                else "No significant difference in distributions."
            )
        except ValueError:
            # Fallback if AD test fails for small or unsuitable data
            ad_stat, ad_significance, ad_interpretation = None, None, "AD test not applicable."

        # Store results for this feature
        results[f"Feature {feature_idx}"] = {
            "KS Statistic": ks_stat,
            "KS P-Value": ks_p_value,
            "KS Significance": ks_significance,
            "KS Interpretation": ks_interpretation,
            "AD Statistic": ad_stat,
            "AD Significance Level": ad_significance,
            "AD Interpretation": ad_interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))  # Slightly shifted test set

    # Perform the diagnostic test
    results = distribution_test(X_train, X_test, alpha=0.05)

    # Print results
    print("Distribution Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Distribution Test Results:
Feature 0:
  - KS Statistic: 0.18
  - KS P-Value: 0.07822115797841851
  - KS Significance: not significant
  - KS Interpretation: No significant difference in distributions.
  - AD Statistic: 2.4790056068311164
  - AD Significance Level: 0.03126256937448677
  - AD Interpretation: Train and test distributions differ significantly.
Feature 1:
  - KS Statistic: 0.11
  - KS P-Value: 0.5830090612540064
  - KS Significance: not significant
  - KS Interpretation: No significant difference in distributions.
  - AD Statistic: -0.035833989133559854
  - AD Significance Level: 0.25
  - AD Interpretation: No significant difference in distributions.
Feature 2:
  - KS Statistic: 0.2
  - KS P-Value: 0.03638428787491733
  - KS Significance: significant
  - KS Interpretation: Train and test distributions differ significantly.
  - AD Statistic: 2.890999741719102
  - AD Significance Level: 0.021483812474621156
  - AD Interpretation: Train and test distributions differ significantly.
Feature 3:
  - KS Statistic: 0.14
  - KS P-Value: 0.2819416298082479
  - KS Significance: not significant
  - KS Interpretation: No significant difference in distributions.
  - AD Statistic: 0.45447025859205664
  - AD Significance Level: 0.21605536724325053
  - AD Interpretation: No significant difference in distributions.
Feature 4:
  - KS Statistic: 0.2
  - KS P-Value: 0.03638428787491733
  - KS Significance: significant
  - KS Interpretation: Train and test distributions differ significantly.
  - AD Statistic: 4.127615824944391
  - AD Significance Level: 0.007231582407157736
  - AD Interpretation: Train and test distributions differ significantly.

Compare Numerical Central Tendencies Compare Categorical Distributions