Compare Numerical Variance
Procedure
This procedure evaluates whether the numerical input and target variables in the train and test sets have the same variance.
-
Test for Normality
- What to do: Assess whether the data in the train and test sets is normally distributed.
- Use the Shapiro-Wilk test or Anderson-Darling test for smaller datasets.
- For larger datasets, use the Kolmogorov-Smirnov test.
- Visualize distributions with histograms or Q-Q plots for additional context.
- What to do: Assess whether the data in the train and test sets is normally distributed.
-
Choose the Appropriate Variance Test
- What to do: Select the appropriate test based on the normality of the data.
- If both sets are normally distributed, use an F-test to compare variances.
- If either set is not normally distributed, use Levene’s test or Bartlett’s test to compare variances.
- Ensure the chosen test aligns with the data’s characteristics.
- What to do: Select the appropriate test based on the normality of the data.
-
Perform the Variance Test
- What to do: Apply the selected test to the numerical variables in the train and test sets.
- Conduct the test independently for each input feature and the target variable.
- Record the test statistics and p-values for analysis.
- What to do: Apply the selected test to the numerical variables in the train and test sets.
-
Interpret the Results
- What to do: Analyze the p-values from the tests to determine if there is a significant difference in variances.
- A p-value less than the chosen significance level (e.g., 0.05) suggests a statistically significant difference in variances.
- A p-value greater than the significance level indicates no significant variance difference, suggesting the sets may be compatible.
- What to do: Analyze the p-values from the tests to determine if there is a significant difference in variances.
-
Report the Findings
- What to do: Document and summarize the results of the variance tests.
- Clearly state which variables showed significant variance differences and which did not.
- Include a brief explanation of the test methodology and its implications for model training or preprocessing adjustments.
- What to do: Document and summarize the results of the variance tests.
Code Example
This Python function checks whether the train and test sets have the same variance for numerical variables using statistical tests and provides an interpretation of the results.
import numpy as np
from scipy.stats import f_oneway, levene, bartlett, shapiro
def variance_test(X_train, X_test, alpha=0.05):
"""
Test whether the train and test sets have the same variance for numerical variables.
Parameters:
X_train (np.ndarray): Train set features.
X_test (np.ndarray): Test set features.
alpha (float): P-value threshold for statistical significance (default=0.05).
Returns:
dict: Dictionary with test results and interpretation for each feature.
"""
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Check normality
_, p_train_normal = shapiro(train_feature)
_, p_test_normal = shapiro(test_feature)
if p_train_normal > alpha and p_test_normal > alpha:
# Both distributions are normal, use Bartlett's test
stat, p_value = bartlett(train_feature, test_feature)
method = "Bartlett's Test"
else:
# At least one distribution is non-normal, use Levene's test
stat, p_value = levene(train_feature, test_feature)
method = "Levene's Test"
# Interpret significance
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test variances differ significantly."
if p_value <= alpha
else "No significant difference in variances."
)
# Store results
results[f"Feature {feature_idx}"] = {
"Method": method,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
X_test = np.random.normal(loc=0, scale=1.2, size=(100, 5)) # Test set with different variance
# Perform the diagnostic test
results = variance_test(X_train, X_test, alpha=0.05)
# Print results
print("Variance Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
Variance Test Results:
Feature 0:
- Method: Bartlett's Test
- P-Value: 0.011036107642909448
- Significance: significant
- Interpretation: Train and test variances differ significantly.
Feature 1:
- Method: Bartlett's Test
- P-Value: 0.03895357883463975
- Significance: significant
- Interpretation: Train and test variances differ significantly.
Feature 2:
- Method: Bartlett's Test
- P-Value: 0.16321431131321001
- Significance: not significant
- Interpretation: No significant difference in variances.
Feature 3:
- Method: Bartlett's Test
- P-Value: 0.045181731478819084
- Significance: significant
- Interpretation: Train and test variances differ significantly.
Feature 4:
- Method: Bartlett's Test
- P-Value: 0.35878313013231045
- Significance: not significant
- Interpretation: No significant difference in variances.