Remove Duplicates
A duplicate row in tabular data is an exact copy of another row where all feature values match completely across all columns, including both predictor variables and target variables if present.
This means every single cell value is identical between the two rows, though some practitioners also consider rows with matching feature values but different target values as duplicates, depending on the context and data quality requirements.
Reasons Duplicates Might Exist
-
Data Entry Errors:
- Manual entry of data may result in multiple identical records being recorded unintentionally.
-
Data Merging or Consolidation:
- Combining datasets without proper deduplication can lead to identical entries being repeated.
- For instance, merging sales data from multiple branches might duplicate customer orders.
-
System or Software Bugs:
- Data pipelines or APIs might insert records multiple times during ingestion.
-
Measurement or Observation Redundancy:
- Multiple measurements or observations may appear identical due to rounding, identical input data, or repeated experiments.
-
Data Augmentation or Sampling:
- In synthetic datasets, augmentation techniques can inadvertently create duplicates.
-
Multiple Sources Reporting the Same Information:
- When data is collected from multiple sources, overlap or duplication can occur.
Problems Caused by Duplicates
Duplicate rows/records present in the dataset can cause problems.
It is helpful consider duplicate records in three specific scenarios:
- Duplicates within the train set.
- Duplicates within the test set.
- Duplicates across the train and test sets.
Let’s take a closer look at each of these cases
1. Duplicate Records in the Training Set
-
Statistical Problems:
- Bias in Model Training: Duplicates act as over-weighted data points, skewing the learned relationships in favor of the repeated records.
- Overfitting: The model might focus excessively on the duplicated data and fail to generalize to unseen data.
-
Impact on Metrics:
- Inflated training performance as the model “memorizes” duplicates.
- Poor generalization to the test set or real-world data.
2. Duplicate Records in the Test Set
-
Statistical Problems:
- Unrepresentative Evaluation: Duplicates do not reflect the variability of real-world data, leading to misleadingly high test scores.
- Overweighted Impact of Certain Samples: If duplicates exist in the test set, some records disproportionately influence the evaluation metrics.
-
Impact on Metrics:
- Overestimation of the model’s performance due to redundant samples reinforcing similar predictions.
3. Duplicate Records Across Train and Test Sets
-
Statistical Problems:
- Data Leakage: The test set contains information that was already seen during training, compromising the integrity of the evaluation.
- Unrealistic Performance: Models perform unrealistically well because they have already “memorized” test examples during training.
-
Impact on Metrics:
- Overly optimistic metrics that fail to reflect real-world generalization.
- Risk of deploying a model that performs poorly in production due to reliance on data it has already seen.
Summary of Problems
Scenario | Problem | Statistical Issues | Metric Implications |
---|---|---|---|
Duplicates in Training Set | Skewed learning, overfitting. | Biased feature relationships, underestimation of noise. | Inflated training accuracy, poor generalization. |
Duplicates in Test Set | Unrepresentative evaluation. | Overweighted impact of certain samples. | Misleadingly high test scores. |
Duplicates Across Train/Test | Data leakage, compromised evaluation integrity. | Unrealistic performance due to memorization. | Overestimated generalization, risky deployment. |
Best Practices to Handle Duplicates
Now we know what duplicates are and the problems they can cause, let’s consider how we might handle them in our data:
Detect and Remove Before Splitting
- Identify exact duplicates using regular functions in NumPy or Pandas, e.g.:
- Use domain knowledge or fuzzy matching for near-duplicates.
Analyze the Impact of Removal
- Ensure sufficient sample size remains after removing duplicates.
- Document changes to the dataset for reproducibility.
Reasons for Not Removing Duplicates
While removing duplicates is generally advisable in machine learning projects, there are a few statistical and practical reasons why one might leave duplicates in the dataset before splitting.
These depend on the context, the nature of the duplicates, and the goals of the project. Here are some situations where leaving duplicates might make sense:
1. Reflecting the True Distribution of the Data
- Reason: Duplicates may be a natural part of the dataset and reflect real-world patterns.
- Example:
- A recommendation system where some items are naturally purchased more frequently by users. Removing duplicates could distort the importance of popular items.
- In healthcare data, repeated entries for the same diagnosis may reflect the prevalence of a condition.
- Impact: Leaving duplicates ensures the dataset retains its original distribution, which is critical if the model needs to generalize to similarly distributed data.
2. Duplicates Represent Important Weighting
- Reason: Duplicates may serve as a form of implicit weighting, where the frequency of entries signals their relative importance.
- Example:
- In customer feedback or surveys, repeated responses could indicate a higher weight for those responses (e.g., multiple complaints about the same issue).
- In time-series data, repeated observations at specific time points may emphasize significant events.
- Impact: Removing duplicates could inadvertently underrepresent critical patterns or signals in the data.
3. Avoiding Information Loss
- Reason: Removing duplicates could lead to a loss of meaningful information if the duplicates are not exact or if additional contextual information is embedded in them.
- Example:
- Text datasets where duplicates may contain slight variations that are contextually important.
- Records from multiple data sources that appear duplicated but may have subtle differences that are relevant for the model.
- Impact: Retaining duplicates avoids discarding potentially valuable insights.
4. Retaining Sufficient Data for Small Datasets
- Reason: In small datasets, removing duplicates might lead to an unacceptably small training set, reducing the model’s ability to learn effectively.
- Example:
- In rare disease classification, duplicate samples might be crucial to maintain enough data points to train a model.
- In experiments where each observation is costly, duplicates may preserve the limited data available.
- Impact: Leaving duplicates ensures the model has enough data to learn meaningful patterns.
5. No Leakage or Bias Introduced by Duplicates
- Reason: If duplicates do not introduce leakage or bias (e.g., if they are naturally distributed across train and test sets or are very rare), removing them may not be necessary.
- Example:
- In a dataset where duplicates occur sporadically, their presence might not significantly impact the results or model generalization.
- Impact: The decision to leave duplicates in this case simplifies preprocessing without harming the model.
Key Considerations Before Leaving Duplicates
- Understand the Source of Duplicates:
- Are they natural (e.g., reflecting real-world frequency) or artifacts (e.g., data collection errors)?
- Assess Potential Bias:
- Will duplicates lead to overfitting or skewed evaluations? Evaluate using domain knowledge and exploratory analysis.
- Consider the Downstream Task:
- If duplicates provide meaningful context for the task (e.g., emphasizing popular items or highlighting critical conditions), they may be valuable.
- Split After Deduplication (If Necessary):
- Ensure that duplicates do not lead to leakage across training and test sets by splitting carefully.