Glossary of Terms
Glossary of Terms
Let’s ensure we’re talking about the same things:
- Chosen Model: A Chosen Model is the specific machine learning algorithm or architecture selected as the final candidate for solving a particular problem after comparing different options during the model selection phase.
- Dataset: A Dataset is a structured collection of data samples, where each sample contains features (input variables) and typically a target variable (for supervised learning), used to train and evaluate machine learning models.
- Generalization Gap: The Generalization Gap is the mathematical difference between a model’s performance metrics on the training data versus previously unseen test data, indicating how well the model generalizes to new examples.
- Leakage: Leakage occurs when information from outside the training set inappropriately influences the model training process, leading to unrealistically optimistic performance estimates that won’t hold in production.
- Overfit: Overfit happens when a model learns the training data too precisely, including its noise and peculiarities, resulting in poor generalization to new, unseen data.
- Performance: Performance is a quantitative measure of how well a model accomplishes its intended task, typically expressed through metrics like accuracy, precision, recall, or mean squared error.
- Resampling: Resampling is a family of statistical methods that involve repeatedly drawing different samples from a dataset to estimate model properties or population parameters, with common approaches including bootstrapping and cross-validation.
- Test Harness: A Test Harness is the complete experimental framework that includes data preprocessing, model training, validation procedures, and evaluation metrics used to assess and compare different models systematically.
- Test Set: The Test Set is a portion of the dataset that is completely held out during training and only used once at the very end to provide an unbiased evaluation of the final model’s performance.
- Train Set: The Train Set is the portion of the dataset used to train the model by adjusting its parameters through the optimization of a loss function.
- Validation Set: The Validation Set is a portion of the dataset, separate from both training and test sets, used to tune hyperparameters and assess model performance during the development process before final testing.