Compare Categorical Distributions
Do categorical input/target variables have the same distributions?
Procedure
This procedure evaluates whether the train and test sets have the same distributions for categorical input or target variables.
-
Choose the Appropriate Statistical Test
- What to do: Decide on the test based on the variable type and expected distributions.
- Use the Chi-Square test for independence to compare categorical distributions.
- For smaller sample sizes or expected frequencies less than 5 in any category, use Fisher’s Exact Test.
- What to do: Decide on the test based on the variable type and expected distributions.
-
Perform the Statistical Test
- What to do: Apply the chosen test to compare the distributions.
- Run the test on each categorical variable of interest separately.
- Record the Chi-Square statistic (or test statistic for Fisher’s Exact Test) and corresponding p-values.
- What to do: Apply the chosen test to compare the distributions.
-
Interpret the Results
- What to do: Evaluate the p-values to determine if there is a statistically significant difference between the distributions.
- A p-value less than the chosen significance level (e.g., 0.05) indicates a significant difference between train and test distributions.
- Larger p-values suggest no significant difference, meaning the distributions are likely similar.
- What to do: Evaluate the p-values to determine if there is a statistically significant difference between the distributions.
-
Report the Findings
- What to do: Summarize the test results and implications for your analysis.
- Clearly list which categorical variables showed significant differences and which did not.
- Include visualizations to support your conclusions and provide insights into any observed discrepancies.
- Discuss how differences might impact the model and propose adjustments if necessary.
- What to do: Summarize the test results and implications for your analysis.
Code Example
This Python function checks whether the train and test sets have the same distributions for categorical input or target variables using the Chi-Square test or Fisher’s Exact Test and provides an interpretation of the results.
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, fisher_exact
def categorical_distribution_test(X_train, X_test, alpha=0.05):
"""
Test whether the train and test sets have the same distributions for categorical variables.
Parameters:
X_train (pd.DataFrame): Train set with categorical variables.
X_test (pd.DataFrame): Test set with categorical variables.
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each categorical variable.
"""
results = {}
for column in X_train.columns:
# Create contingency table
train_counts = X_train[column].value_counts()
test_counts = X_test[column].value_counts()
categories = set(train_counts.index).union(set(test_counts.index))
contingency_table = np.array([
[train_counts.get(cat, 0) for cat in categories],
[test_counts.get(cat, 0) for cat in categories]
])
# Apply Chi-Square Test
if np.any(contingency_table < 5): # Small frequencies, use Fisher's Exact Test if 2x2 table
if contingency_table.shape == (2, 2):
_, p_value = fisher_exact(contingency_table)
method = "Fisher's Exact Test"
else:
p_value = None
method = "Not Applicable (table not 2x2)"
else:
_, p_value, _, _ = chi2_contingency(contingency_table)
method = "Chi-Square Test"
# Interpret significance
if p_value is not None:
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test distributions differ significantly."
if p_value <= alpha
else "No significant difference in distributions."
)
else:
significance = "not tested"
interpretation = "Fisher's Exact Test not applicable for non-2x2 tables."
# Store results
results[column] = {
"Method": method,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = pd.DataFrame({
"Category1": np.random.choice(["A", "B", "C"], size=100, p=[0.5, 0.3, 0.2]),
"Category2": np.random.choice(["X", "Y"], size=100, p=[0.6, 0.4]),
})
X_test = pd.DataFrame({
"Category1": np.random.choice(["A", "B", "C"], size=100, p=[0.4, 0.4, 0.2]),
"Category2": np.random.choice(["X", "Y"], size=100, p=[0.7, 0.3]),
})
# Perform the diagnostic test
results = categorical_distribution_test(X_train, X_test, alpha=0.05)
# Print results
print("Categorical Distribution Test Results:")
for column, result in results.items():
print(f"{column}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Categorical Distribution Test Results:
Category1:
- Method: Chi-Square Test
- P-Value: 0.12778404339578855
- Significance: not significant
- Interpretation: No significant difference in distributions.
Category2:
- Method: Chi-Square Test
- P-Value: 0.10294339894358191
- Significance: not significant
- Interpretation: No significant difference in distributions.