Frequently Asked Questions

Diagnostics

Q. Isn’t this just undergrad stats?

Yes, applied systematically.

Q. Do I need to perform all of these checks?

No. Use the checks that are appropriate for your project. For example, checks that use k-fold cross-validation may not be appropriate for a deep learning model that takes days/weeks to fit. Use your judgement.

Q. Are these checks overkill?

Yes, probably.

Q. What’s the 80/20?

Follow best practices in the split procedure. Using distribution checks will catch most problems on a pre-defined test set. Avoid test set leakage with test harness best practices like pipelines and nested cross-validation. Avoid a final point performance estimates with bootstrapping.

Q. What’s the most common cause of the problem?

The most common cause is poor/fragile test harness (use best practices!). After that, the cause is statistical noise, which is ironed out with more robust performance estimates (e.g. bootstrap the test set and compare distributions/confidence intervals). After that, its some kind of test set leakage or a test set provided by a third party/stakeholder with differing distributions.

Q. Don’t I risk test set leakage and overfitting if I evaluate a model on the test set more than once?

Yes. Strictly, the test set must be discarded after being used once. You can choose to bend this rule at your own risk. If you use these checks at the beginning of a project and carefully avoid and omit any summary stats/plots of data and model performance on the test set (e.g. only report the outcomes of statistical tests), then I suspect the risk is minimal.

Q. Do all of these checks help?

No. There is a lot of redundancy across the categories and the checks. This is a good thing as it gives many different perspectives on evidence gathering and more opportunities that we will catch the cause of the fault. Also, the specific aspects of some projects make some avenues of checking available and some not.

Q. Don’t we always expect a generalization gap?

Perhaps. But how do you know the gap you see is reasonable for your specific predictive modeling task and test harness?

Q. Doesn’t R/scikit-learn/LLMs/XGBoost/AGI/etc. solve this for me?

No.

Q. Does this really matter?

Yes, perhaps. It comes down to a judgement call, like most things.

Q. Do you do this on your own projects?

Yeah, a lot of it. I want to be able to (metaphorically) stand tall, put my hand on my heart, and testify that the test set results are as correct and indicative of real world performance as I know how to make them.

Q. Why do we sometimes use a paired t-Test and sometimes not (and Mann-Whitney U Test vs Wilcoxon signed-rank test)

When comparing distribution central tendencies on train vs test folds under k-fold cross-validation, we use a paired statistical hypothesis test like the paired Student’s t-Test or the Wilcoxon signed-rank test because the train/test folds are naturally dependent upon each other. If we do not account for this dependence, we would miss-estimate the significance. When the two data distributions are independent, we can use an independent test such as the (unpaired) Student’s t-Test or the Mann-Whitney U Test.

Q. What about measure of model complexity to help diagnose overfitting (e.g. AIC/BIC/MDL/etc.)?

I’ve not had much luck with these measures in practice. Also, classical ideas of overfitting were focused on “overparameterization”. I’m not convinced this is entirely relevant with modern machine learning and deep learning models. We often overparameterize and still achieve SOTA performance. Put another way: Overfitting != Overparameterization. That being said, regularizing away from large magnitude coefficients/weights remains a good practice when overfit.

Q. The test result is ambiguous, what do I do?

This is the case more often than not. Don’t panic, here are some ideas:

Can you increase the sensitivity of the test (e.g. smaller p value, larger sample, etc.)?
Can you perform a sensitivity analysis for the test and see if there is a point of transition?
Can you try a different but related test to see if it provides less ambiguous results?
Can you try a completely different test type to see if it gives a stronger signal?

Q. All checks suggest my test set is no good, what do I do?

Good, we learned something about your project! See the interventions.

Q. Can you recommend further reading to learn more about this type of analysis?

Yes, read these books:

Statistics in Plain English (2016). Get up to speed on parametric statistics.
Nonparametric Statistics for Non-Statisticians (2009). Get up to speed on non-parametric statistics.
Introduction to the New Statistics (2024). Get up to speed on estimation statistics.
Empirical Methods for Artificial Intelligence (1995). This is an old favorite that will give some practical grounding.

Glossary of Terms