Compare Numerical Central Tendencies
Do numerical input/target variables have the same central tendencies?
Procedure
This procedure evaluates whether the train and test sets have the same central tendencies in their numerical input and target variables.
-
Test for Normality
- What to do: Determine whether the data in the train and test sets follows a normal distribution.
- Use the Shapiro-Wilk test or Anderson-Darling test for small datasets.
- For larger datasets, the Kolmogorov-Smirnov test is also an option.
- Visualize distributions with histograms or Q-Q plots to support interpretation.
- What to do: Determine whether the data in the train and test sets follows a normal distribution.
-
Choose the Appropriate Statistical Test
- What to do: Decide on the test based on the normality of the data distributions.
- If both sets are normally distributed, use a t-test to compare the means.
- If at least one set is not normally distributed, use the Mann-Whitney U test to compare medians.
- Ensure the test is applied to both input features and target variables.
- What to do: Decide on the test based on the normality of the data distributions.
-
Perform the Statistical Test
- What to do: Run the chosen statistical test on the numerical variables of interest.
- Perform the test independently for each input feature and the target variable.
- Record the p-values for each test result to assess statistical significance.
- What to do: Run the chosen statistical test on the numerical variables of interest.
-
Interpret the Results
- What to do: Evaluate the p-values to determine if the train and test sets have similar distributions.
- A p-value less than the chosen significance level (e.g., 0.05) indicates a statistically significant difference in distributions.
- Larger p-values suggest no significant difference, meaning the sets may have similar central tendencies.
- What to do: Evaluate the p-values to determine if the train and test sets have similar distributions.
-
Report the Findings
- What to do: Document the test results and conclusions in a clear and concise manner.
- Summarize which variables showed significant differences and which did not.
- Include a brief explanation of the test used and the implications for model performance or further data preprocessing steps.
- What to do: Document the test results and conclusions in a clear and concise manner.
Code Example
This Python function checks whether the train and test sets have the same central tendencies for numerical variables using statistical tests and provides an interpretation of the results.
import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu, shapiro
def central_tendency_test(X_train, X_test, alpha=0.05):
"""
Test whether the train and test sets have the same central tendencies for numerical variables.
Parameters:
X_train (np.ndarray): Train set features.
X_test (np.ndarray): Test set features.
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each feature.
"""
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Check normality
_, p_train_normal = shapiro(train_feature)
_, p_test_normal = shapiro(test_feature)
if p_train_normal > alpha and p_test_normal > alpha:
# Both distributions are normal, use t-test
stat, p_value = ttest_ind(train_feature, test_feature)
method = "t-Test"
else:
# At least one distribution is non-normal, use Mann-Whitney U test
stat, p_value = mannwhitneyu(train_feature, test_feature)
method = "Mann-Whitney U Test"
# Interpret significance
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test distributions differ significantly."
if p_value <= alpha
else "No significant difference in distributions."
)
# Store results
results[f"Feature {feature_idx}"] = {
"Method": method,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5)) # Slightly shifted test set
# Perform the diagnostic test
results = central_tendency_test(X_train, X_test, alpha=0.05)
# Print results
print("Central Tendency Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Central Tendency Test Results:
Feature 0:
- Method: t-Test
- P-Value: 0.07475507958378685
- Significance: not significant
- Interpretation: No significant difference in distributions.
Feature 1:
- Method: t-Test
- P-Value: 0.8803772375021313
- Significance: not significant
- Interpretation: No significant difference in distributions.
Feature 2:
- Method: t-Test
- P-Value: 0.009324464083976927
- Significance: significant
- Interpretation: Train and test distributions differ significantly.
Feature 3:
- Method: t-Test
- P-Value: 0.569418532941534
- Significance: not significant
- Interpretation: No significant difference in distributions.
Feature 4:
- Method: t-Test
- P-Value: 0.006720979244255432
- Significance: significant
- Interpretation: Train and test distributions differ significantly.