Compare Distribution of Univariate Outliers
Procedure
This procedure evaluates whether the train and test sets have a similar distribution of univariate outliers in their numerical input and target variables.
-
Identify Univariate Outliers
- What to do: Detect univariate outliers in the numerical variables of the train and test datasets.
- Use methods like the 1.5x IQR rule (values outside [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]) or z-scores (values with |z| > 3) to identify outliers.
- Perform this step for each numerical feature and the target variable independently.
- What to do: Detect univariate outliers in the numerical variables of the train and test datasets.
-
Count and Summarize Outliers
- What to do: Calculate the number and proportion of outliers in both the train and test sets for each variable.
- Summarize the outlier counts in a table, showing the percentage of outliers relative to the total number of observations.
- Visualize the distribution of outliers (e.g., boxplots or histograms) for better understanding.
- What to do: Calculate the number and proportion of outliers in both the train and test sets for each variable.
-
Choose the Statistical Test
- What to do: Select a test to compare the proportion of outliers between the train and test sets.
- If comparing counts or proportions of categorical data, use a chi-square test or Fisher’s exact test if sample sizes are small.
- Alternatively, compare distributions of outlier magnitudes using non-parametric tests like the Mann-Whitney U test if appropriate.
- What to do: Select a test to compare the proportion of outliers between the train and test sets.
-
Perform the Statistical Test
- What to do: Apply the chosen statistical test to the summarized outlier data.
- Run the test independently for each variable.
- Record the p-values for each comparison to assess statistical significance.
- What to do: Apply the chosen statistical test to the summarized outlier data.
-
Interpret the Results
- What to do: Analyze the p-values to determine if the distribution of outliers differs significantly between the train and test sets.
- A p-value below the significance threshold (e.g., 0.05) suggests a significant difference in the outlier distributions.
- Higher p-values indicate no significant difference, implying similar outlier patterns between the datasets.
- What to do: Analyze the p-values to determine if the distribution of outliers differs significantly between the train and test sets.
-
Report the Findings
- What to do: Document the results and their implications clearly.
- Include which variables showed significant differences in outlier distributions and which did not.
- Explain how this might affect downstream tasks like model training or performance, and recommend any necessary preprocessing adjustments.
- What to do: Document the results and their implications clearly.
Code Example
This Python function checks whether the train and test sets have a similar distribution of univariate outliers for numerical variables and provides an interpretation of the results.
import numpy as np
from scipy.stats import chisquare
def univariate_outlier_distribution_test(X_train, X_test, alpha=0.05, method="IQR"):
"""
Test whether the train and test sets have a similar distribution of univariate outliers.
Parameters:
X_train (np.ndarray): Train set features.
X_test (np.ndarray): Test set features.
alpha (float): P-value threshold for statistical significance (default=0.05).
method (str): Method to detect outliers ("IQR" or "Z-Score").
Returns:
dict: Dictionary with test results and interpretation for each feature.
"""
results = {}
for feature_idx in range(X_train.shape[1]):
train_feature = X_train[:, feature_idx]
test_feature = X_test[:, feature_idx]
# Detect outliers
if method == "IQR":
q1_train, q3_train = np.percentile(train_feature, [25, 75])
q1_test, q3_test = np.percentile(test_feature, [25, 75])
iqr_train = q3_train - q1_train
iqr_test = q3_test - q1_test
train_outliers = ((train_feature < q1_train - 1.5 * iqr_train) |
(train_feature > q3_train + 1.5 * iqr_train)).sum()
test_outliers = ((test_feature < q1_test - 1.5 * iqr_test) |
(test_feature > q3_test + 1.5 * iqr_test)).sum()
elif method == "Z-Score":
z_train = (train_feature - train_feature.mean()) / train_feature.std()
z_test = (test_feature - test_feature.mean()) / test_feature.std()
train_outliers = (np.abs(z_train) > 3).sum()
test_outliers = (np.abs(z_test) > 3).sum()
else:
raise ValueError("Invalid method. Choose 'IQR' or 'Z-Score'.")
# Combine counts into an array
outlier_counts = np.array([train_outliers, test_outliers])
# Perform chi-square test
_, p_value = chisquare(outlier_counts)
# Interpret significance
significance = "significant" if p_value <= alpha else "not significant"
interpretation = (
"Train and test outlier distributions differ significantly."
if p_value <= alpha
else "No significant difference in outlier distributions."
)
# Store results
results[f"Feature {feature_idx}"] = {
"Train Outliers": train_outliers,
"Test Outliers": test_outliers,
"P-Value": p_value,
"Significance": significance,
"Interpretation": interpretation,
}
return results
# Demo Usage
if __name__ == "__main__":
# Generate synthetic data
np.random.seed(42)
X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5)) # Slightly shifted test set
# Perform the diagnostic test
results = univariate_outlier_distribution_test(X_train, X_test, alpha=0.05, method="IQR")
# Print results
print("Univariate Outlier Distribution Test Results:")
for feature, result in results.items():
print(f"{feature}:")
for key, value in result.items():
print(f" - {key}: {value}")
Example Output
/usr/local/lib/python3.12/site-packages/scipy/stats/_stats_py.py:7654: RuntimeWarning: invalid value encountered in divide
terms = (f_obs - f_exp)**2 / f_exp
Univariate Outlier Distribution Test Results:
Feature 0:
- Train Outliers: 0
- Test Outliers: 2
- P-Value: 0.15729920705028105
- Significance: not significant
- Interpretation: No significant difference in outlier distributions.
Feature 1:
- Train Outliers: 0
- Test Outliers: 0
- P-Value: nan
- Significance: not significant
- Interpretation: No significant difference in outlier distributions.
Feature 2:
- Train Outliers: 1
- Test Outliers: 3
- P-Value: 0.31731050786291115
- Significance: not significant
- Interpretation: No significant difference in outlier distributions.
Feature 3:
- Train Outliers: 1
- Test Outliers: 1
- P-Value: 1.0
- Significance: not significant
- Interpretation: No significant difference in outlier distributions.
Feature 4:
- Train Outliers: 1
- Test Outliers: 0
- P-Value: 0.31731050786291115
- Significance: not significant
- Interpretation: No significant difference in outlier distributions.