Compare Categorical Distributions

Diagnostics

Do categorical input/target variables have the same distributions?

Procedure

This procedure evaluates whether the train and test sets have the same distributions for categorical input or target variables.

Choose the Appropriate Statistical Test
- What to do: Decide on the test based on the variable type and expected distributions.
  - Use the Chi-Square test for independence to compare categorical distributions.
  - For smaller sample sizes or expected frequencies less than 5 in any category, use Fisher’s Exact Test.
Perform the Statistical Test
- What to do: Apply the chosen test to compare the distributions.
  - Run the test on each categorical variable of interest separately.
  - Record the Chi-Square statistic (or test statistic for Fisher’s Exact Test) and corresponding p-values.
Interpret the Results
- What to do: Evaluate the p-values to determine if there is a statistically significant difference between the distributions.
  - A p-value less than the chosen significance level (e.g., 0.05) indicates a significant difference between train and test distributions.
  - Larger p-values suggest no significant difference, meaning the distributions are likely similar.
Report the Findings
- What to do: Summarize the test results and implications for your analysis.
  - Clearly list which categorical variables showed significant differences and which did not.
  - Include visualizations to support your conclusions and provide insights into any observed discrepancies.
  - Discuss how differences might impact the model and propose adjustments if necessary.

Code Example

This Python function checks whether the train and test sets have the same distributions for categorical input or target variables using the Chi-Square test or Fisher’s Exact Test and provides an interpretation of the results.

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, fisher_exact

def categorical_distribution_test(X_train, X_test, alpha=0.05):
    """
    Test whether the train and test sets have the same distributions for categorical variables.

    Parameters:
        X_train (pd.DataFrame): Train set with categorical variables.
        X_test (pd.DataFrame): Test set with categorical variables.
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each categorical variable.
    """
    results = {}

    for column in X_train.columns:
        # Create contingency table
        train_counts = X_train[column].value_counts()
        test_counts = X_test[column].value_counts()
        categories = set(train_counts.index).union(set(test_counts.index))

        contingency_table = np.array([
            [train_counts.get(cat, 0) for cat in categories],
            [test_counts.get(cat, 0) for cat in categories]
        ])

        # Apply Chi-Square Test
        if np.any(contingency_table < 5):  # Small frequencies, use Fisher's Exact Test if 2x2 table
            if contingency_table.shape == (2, 2):
                _, p_value = fisher_exact(contingency_table)
                method = "Fisher's Exact Test"
            else:
                p_value = None
                method = "Not Applicable (table not 2x2)"
        else:
            _, p_value, _, _ = chi2_contingency(contingency_table)
            method = "Chi-Square Test"

        # Interpret significance
        if p_value is not None:
            significance = "significant" if p_value <= alpha else "not significant"
            interpretation = (
                "Train and test distributions differ significantly."
                if p_value <= alpha
                else "No significant difference in distributions."
            )
        else:
            significance = "not tested"
            interpretation = "Fisher's Exact Test not applicable for non-2x2 tables."

        # Store results
        results[column] = {
            "Method": method,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = pd.DataFrame({
        "Category1": np.random.choice(["A", "B", "C"], size=100, p=[0.5, 0.3, 0.2]),
        "Category2": np.random.choice(["X", "Y"], size=100, p=[0.6, 0.4]),
    })

    X_test = pd.DataFrame({
        "Category1": np.random.choice(["A", "B", "C"], size=100, p=[0.4, 0.4, 0.2]),
        "Category2": np.random.choice(["X", "Y"], size=100, p=[0.7, 0.3]),
    })

    # Perform the diagnostic test
    results = categorical_distribution_test(X_train, X_test, alpha=0.05)

    # Print results
    print("Categorical Distribution Test Results:")
    for column, result in results.items():
        print(f"{column}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Categorical Distribution Test Results:
Category1:
  - Method: Chi-Square Test
  - P-Value: 0.12778404339578855
  - Significance: not significant
  - Interpretation: No significant difference in distributions.
Category2:
  - Method: Chi-Square Test
  - P-Value: 0.10294339894358191
  - Significance: not significant
  - Interpretation: No significant difference in distributions.

Compare Numerical Distributions Compare Numerical Variance