Robustness to Label Noise

Diagnostics

Procedure

This procedure evaluates the robustness of a classification model by analyzing its performance variance when class labels in the target variable are flipped. This helps determine whether the model is overly sensitive to perturbations in the target labels, which could indicate overfitting or poor generalization.

Define Parameters for Label Flipping
- What to do: Decide the proportion of target labels to flip.
  - Choose a small percentage (e.g., 5% or 10%) of class labels in the test set to randomly flip.
  - Ensure the flipping process maintains the class balance to avoid introducing systematic bias.
Generate Flipped Label Datasets
- What to do: Create multiple versions of the test dataset with flipped labels.
  - Randomly select instances from the test set and flip their class labels according to the defined proportion.
  - Generate at least 5-10 test sets with different random flips for a representative analysis.
Evaluate the Model on Flipped Label Datasets
- What to do: Test the model on each of the flipped label datasets.
  - Use the same test set structure, model configuration, and performance metric as the original evaluation.
  - Ensure the evaluation process is consistent to isolate the effect of label flipping.
Record Performance Metrics
- What to do: Capture the performance metrics for the model on each flipped label dataset.
  - Use metrics such as accuracy, precision, recall, or F1-score to measure performance.
  - Organize the results systematically, such as in a table or spreadsheet, for easy comparison.
Calculate Variance in Performance
- What to do: Compute the variance (or standard deviation) of the performance metrics across all flipped label datasets.
  - Use statistical tools (e.g., Pandas, NumPy) to calculate the variance.
  - Evaluate whether the variance is modest (indicating robustness) or significant (indicating sensitivity).
Interpret Variance Results
- What to do: Analyze the performance variance to assess robustness.
  - Low variance suggests the model is robust to perturbations in the target variable.
  - High variance indicates sensitivity to label noise, which may signal overfitting or reliance on specific patterns in the training data.
Report the Findings
- What to do: Summarize the results and their implications.
  - Present the computed variance alongside individual performance metrics for each flipped label dataset.
  - Highlight trends, such as significant drops in performance, and recommend next steps (e.g., enhancing regularization, collecting higher-quality labeled data, or reconsidering the model’s complexity).

Code Example

This Python function evaluates whether a classification model is robust by analyzing the variance in performance metrics when class labels in the target variable are flipped.

import numpy as np
from sklearn.metrics import accuracy_score

def robustness_with_label_flipping(X, y, model, metric, flip_proportions, n_runs):
    """
    Evaluate a model's robustness by analyzing performance variance when target labels are flipped.

    Parameters:
        X (np.ndarray): Features of the test dataset.
        y (np.ndarray): Target variable of the test dataset.
        model: Pre-trained machine learning model to evaluate.
        metric (callable): Performance metric function.
        flip_proportions (list): List of proportions of target labels to flip.
        n_runs (int): Number of datasets to generate for each flip proportion.

    Returns:
        dict: Dictionary containing flip proportion, variance, mean performance, and interpretation for each flip proportion.
    """
    results = []

    for flip_prop in flip_proportions:
        performance_scores = []

        for _ in range(n_runs):
            # Create a copy of the target variable
            y_flipped = y.copy()

            # Randomly flip a proportion of the labels
            n_flip = int(len(y) * flip_prop)
            flip_indices = np.random.choice(len(y), n_flip, replace=False)
            unique_classes = np.unique(y)

            for idx in flip_indices:
                current_label = y_flipped[idx]
                new_label = np.random.choice(unique_classes[unique_classes != current_label])
                y_flipped[idx] = new_label

            # Predict and evaluate the model
            y_pred = model.predict(X)
            score = metric(y_flipped, y_pred)
            performance_scores.append(score)

        # Calculate variance and mean performance
        variance = np.var(performance_scores)
        mean_performance = np.mean(performance_scores)

        # Interpret the results
        if variance < 0.01:
            interpretation = "High robustness: very low performance variance at this flip proportion."
        elif variance < 0.05:
            interpretation = "Moderate robustness: acceptable performance variance at this flip proportion."
        else:
            interpretation = "Low robustness: high sensitivity to label flipping at this level."

        results.append({
            "Flip Proportion": flip_prop,
            "Variance": variance,
            "Mean Performance": mean_performance,
            "Interpretation": interpretation,
        })

    return results

# Demo Usage
if __name__ == "__main__":
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import make_classification

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5, random_state=42
    )

    # Split into train and test sets
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train the model
    model = RandomForestClassifier(n_estimators=10, random_state=42)
    model.fit(X_train, y_train)

    # Perform robustness test with label flipping
    results = robustness_with_label_flipping(X_test, y_test, model,
        accuracy_score, [0.0, 0.05, 0.1, 0.2], 30)

    # Print results
    print("Robustness Test Results with Label Flipping:")
    for res in results:
        for key, value in res.items():
            print(f"{key}: {value}")
        print()

Example Output

Robustness Test Results with Label Flipping:
Flip Proportion: 0.0
Variance: 1.232595164407831e-32
Mean Performance: 0.8900000000000001
Interpretation: High robustness: very low performance variance at this flip proportion.

Flip Proportion: 0.05
Variance: 0.00020622222222222255
Mean Performance: 0.8506666666666666
Interpretation: High robustness: very low performance variance at this flip proportion.

Flip Proportion: 0.1
Variance: 0.00040933333333333213
Mean Performance: 0.8119999999999999
Interpretation: High robustness: very low performance variance at this flip proportion.

Flip Proportion: 0.2
Variance: 0.0005315555555555564
Mean Performance: 0.7353333333333334
Interpretation: High robustness: very low performance variance at this flip proportion.

Robustness to Target Noise Robustness to Learning Algorithm Noise