Glossary of Terms

Glossary of Terms

Let’s ensure we’re talking about the same things:

  • Chosen Model: A Chosen Model is the specific machine learning algorithm or architecture selected as the final candidate for solving a particular problem after comparing different options during the model selection phase.
  • Dataset: A Dataset is a structured collection of data samples, where each sample contains features (input variables) and typically a target variable (for supervised learning), used to train and evaluate machine learning models.
  • Generalization Gap: The Generalization Gap is the mathematical difference between a model’s performance metrics on the training data versus previously unseen test data, indicating how well the model generalizes to new examples.
  • Leakage: Leakage occurs when information from outside the training set inappropriately influences the model training process, leading to unrealistically optimistic performance estimates that won’t hold in production.
  • Overfit: Overfit happens when a model learns the training data too precisely, including its noise and peculiarities, resulting in poor generalization to new, unseen data.
  • Performance: Performance is a quantitative measure of how well a model accomplishes its intended task, typically expressed through metrics like accuracy, precision, recall, or mean squared error.
  • Resampling: Resampling is a family of statistical methods that involve repeatedly drawing different samples from a dataset to estimate model properties or population parameters, with common approaches including bootstrapping and cross-validation.
  • Test Harness: A Test Harness is the complete experimental framework that includes data preprocessing, model training, validation procedures, and evaluation metrics used to assess and compare different models systematically.
  • Test Set: The Test Set is a portion of the dataset that is completely held out during training and only used once at the very end to provide an unbiased evaluation of the final model’s performance.
  • Train Set: The Train Set is the portion of the dataset used to train the model by adjusting its parameters through the optimization of a loss function.
  • Validation Set: The Validation Set is a portion of the dataset, separate from both training and test sets, used to tune hyperparameters and assess model performance during the development process before final testing.