Compare Numerical Central Tendencies

Diagnostics

Do numerical input/target variables have the same central tendencies?

Procedure

This procedure evaluates whether the train and test sets have the same central tendencies in their numerical input and target variables.

Test for Normality
- What to do: Determine whether the data in the train and test sets follows a normal distribution.
  - Use the Shapiro-Wilk test or Anderson-Darling test for small datasets.
  - For larger datasets, the Kolmogorov-Smirnov test is also an option.
  - Visualize distributions with histograms or Q-Q plots to support interpretation.
Choose the Appropriate Statistical Test
- What to do: Decide on the test based on the normality of the data distributions.
  - If both sets are normally distributed, use a t-test to compare the means.
  - If at least one set is not normally distributed, use the Mann-Whitney U test to compare medians.
  - Ensure the test is applied to both input features and target variables.
Perform the Statistical Test
- What to do: Run the chosen statistical test on the numerical variables of interest.
  - Perform the test independently for each input feature and the target variable.
  - Record the p-values for each test result to assess statistical significance.
Interpret the Results
- What to do: Evaluate the p-values to determine if the train and test sets have similar distributions.
  - A p-value less than the chosen significance level (e.g., 0.05) indicates a statistically significant difference in distributions.
  - Larger p-values suggest no significant difference, meaning the sets may have similar central tendencies.
Report the Findings
- What to do: Document the test results and conclusions in a clear and concise manner.
  - Summarize which variables showed significant differences and which did not.
  - Include a brief explanation of the test used and the implications for model performance or further data preprocessing steps.

Code Example

This Python function checks whether the train and test sets have the same central tendencies for numerical variables using statistical tests and provides an interpretation of the results.

import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu, shapiro

def central_tendency_test(X_train, X_test, alpha=0.05):
    """
    Test whether the train and test sets have the same central tendencies for numerical variables.

    Parameters:
        X_train (np.ndarray): Train set features.
        X_test (np.ndarray): Test set features.
        alpha (float): P-value threshold for statistical significance (default=0.05).

    Returns:
        dict: Dictionary with test results and interpretation for each feature.
    """
    results = {}

    for feature_idx in range(X_train.shape[1]):
        train_feature = X_train[:, feature_idx]
        test_feature = X_test[:, feature_idx]

        # Check normality
        _, p_train_normal = shapiro(train_feature)
        _, p_test_normal = shapiro(test_feature)

        if p_train_normal > alpha and p_test_normal > alpha:
            # Both distributions are normal, use t-test
            stat, p_value = ttest_ind(train_feature, test_feature)
            method = "t-Test"
        else:
            # At least one distribution is non-normal, use Mann-Whitney U test
            stat, p_value = mannwhitneyu(train_feature, test_feature)
            method = "Mann-Whitney U Test"

        # Interpret significance
        significance = "significant" if p_value <= alpha else "not significant"
        interpretation = (
            "Train and test distributions differ significantly."
            if p_value <= alpha
            else "No significant difference in distributions."
        )

        # Store results
        results[f"Feature {feature_idx}"] = {
            "Method": method,
            "P-Value": p_value,
            "Significance": significance,
            "Interpretation": interpretation
        }

    return results

# Demo Usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.normal(loc=0, scale=1, size=(100, 5))
    X_test = np.random.normal(loc=0.2, scale=1.2, size=(100, 5))  # Slightly shifted test set

    # Perform the diagnostic test
    results = central_tendency_test(X_train, X_test, alpha=0.05)

    # Print results
    print("Central Tendency Test Results:")
    for feature, result in results.items():
        print(f"{feature}:")
        for key, value in result.items():
            print(f"  - {key}: {value}")

Example Output

Central Tendency Test Results:
Feature 0:
  - Method: t-Test
  - P-Value: 0.07475507958378685
  - Significance: not significant
  - Interpretation: No significant difference in distributions.
Feature 1:
  - Method: t-Test
  - P-Value: 0.8803772375021313
  - Significance: not significant
  - Interpretation: No significant difference in distributions.
Feature 2:
  - Method: t-Test
  - P-Value: 0.009324464083976927
  - Significance: significant
  - Interpretation: Train and test distributions differ significantly.
Feature 3:
  - Method: t-Test
  - P-Value: 0.569418532941534
  - Significance: not significant
  - Interpretation: No significant difference in distributions.
Feature 4:
  - Method: t-Test
  - P-Value: 0.006720979244255432
  - Significance: significant
  - Interpretation: Train and test distributions differ significantly.

Compare Numerical Distributions