Compare Distribution of Univariate Outliers

Diagnostics

Procedure

This procedure evaluates whether the train and test sets have a similar distribution of univariate outliers in their numerical input and target variables.

Identify Univariate Outliers
- What to do: Detect univariate outliers in the numerical variables of the train and test datasets.
  - Use methods like the 1.5x IQR rule (values outside [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]) or z-scores (values with |z| > 3) to identify outliers.
  - Perform this step for each numerical feature and the target variable independently.
Count and Summarize Outliers
- What to do: Calculate the number and proportion of outliers in both the train and test sets for each variable.
  - Summarize the outlier counts in a table, showing the percentage of outliers relative to the total number of observations.
  - Visualize the distribution of outliers (e.g., boxplots or histograms) for better understanding.
Choose the Statistical Test
- What to do: Select a test to compare the proportion of outliers between the train and test sets.
  - If comparing counts or proportions of categorical data, use a chi-square test or Fisher’s exact test if sample sizes are small.
  - Alternatively, compare distributions of outlier magnitudes using non-parametric tests like the Mann-Whitney U test if appropriate.
Perform the Statistical Test
- What to do: Apply the chosen statistical test to the summarized outlier data.
  - Run the test independently for each variable.
  - Record the p-values for each comparison to assess statistical significance.
Interpret the Results
- What to do: Analyze the p-values to determine if the distribution of outliers differs significantly between the train and test sets.
  - A p-value below the significance threshold (e.g., 0.05) suggests a significant difference in the outlier distributions.
  - Higher p-values indicate no significant difference, implying similar outlier patterns between the datasets.
Report the Findings
- What to do: Document the results and their implications clearly.
  - Include which variables showed significant differences in outlier distributions and which did not.
  - Explain how this might affect downstream tasks like model training or performance, and recommend any necessary preprocessing adjustments.

Code Example

This Python function checks whether the train and test sets have a similar distribution of univariate outliers for numerical variables and provides an interpretation of the results.

import numpy as np
from scipy.stats import chisquare

def univariate_outlier_distribution_test(X_train, X_test, alpha=0.05, method="IQR"):
    """
    Test whether the train and test sets have a similar distribution of univariate outliers.

    Parameters:
        X_train (np.ndarray): Train set features.
        X_test (np.ndarray): Test set features.
        alpha (float): P-value threshold for statistical significance (default=0.05).
        method (str): Method to detect outliers ("IQR" or "Z-Score").

    Returns:
        dict: Dictionary with test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Detect outliers
        if method == "IQR":
            q1_train, q3_train = np.percentile(train_feature, [25, 75])
            q1_test, q3_test = np.percentile(test_feature, [25, 75])
            iqr_train = q3_train - q1_train
            iqr_test = q3_test - q1_test

            train_outliers = ((train_feature < q1_train - 1.5 * iqr_train) |
                              (train_feature > q3_train + 1.5 * iqr_train)).sum()
            test_outliers = ((test_feature < q1_test - 1.5 * iqr_test) |
                             (test_feature > q3_test + 1.5 * iqr_test)).sum()
        elif method == "Z-Score":
            z_train = (train_feature - train_feature.mean()) / train_feature.std()
            z_test = (test_feature - test_feature.mean()) / test_feature.std()
            train_outliers = (np.abs(z_train) > 3).sum()
            test_outliers = (np.abs(z_test) > 3).sum()
        else:
            raise ValueError("Invalid method. Choose 'IQR' or 'Z-Score'.")

        # Combine counts into an array
        outlier_counts = np.array([train_outliers, test_outliers])

        # Perform chi-square test
        _, p_value = chisquare(outlier_counts)

        # Interpret significance
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            "Train and test outlier distributions differ significantly."
            if p_value <= alpha
            else "No significant difference in outlier distributions."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Train Outliers": train_outliers,
            "Test Outliers": test_outliers,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation,
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))  # Slightly shifted test set

    # Perform the diagnostic test
    results = univariate_outlier_distribution_test(X_train, X_test, alpha=0.05, method="IQR")

    # Print results
    print("Univariate Outlier Distribution Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

/usr/local/lib/python3.12/site-packages/scipy/stats/_stats_py.py:7654: RuntimeWarning: invalid value encountered in divide
  terms = (f_obs - f_exp)**2 / f_exp
Univariate Outlier Distribution Test Results:
Feature 0:
  - Train Outliers: 0
  - Test Outliers: 2
  - P-Value: 0.15729920705028105
  - Significance: not significant
  - Interpretation: No significant difference in outlier distributions.
Feature 1:
  - Train Outliers: 0
  - Test Outliers: 0
  - P-Value: nan
  - Significance: not significant
  - Interpretation: No significant difference in outlier distributions.
Feature 2:
  - Train Outliers: 1
  - Test Outliers: 3
  - P-Value: 0.31731050786291115
  - Significance: not significant
  - Interpretation: No significant difference in outlier distributions.
Feature 3:
  - Train Outliers: 1
  - Test Outliers: 1
  - P-Value: 1.0
  - Significance: not significant
  - Interpretation: No significant difference in outlier distributions.
Feature 4:
  - Train Outliers: 1
  - Test Outliers: 0
  - P-Value: 0.31731050786291115
  - Significance: not significant
  - Interpretation: No significant difference in outlier distributions.

Calculate Numerical Effect Size Compare Distribution of Missing Values