PortfolioBlog

Resampling Classes with Ensemble Models

1/31/2025 | By Saksham Adhikari

When dealing with unbalanced datasets, one effective strategy is to use resampling techniques within an ensemble learning framework. Here's how this works:

  • Resampling: This involves altering the dataset's composition to achieve a more balanced class distribution. There are two primary types of resampling:
    • Oversampling: Increasing the number of instances in the minority class by replicating or generating synthetic samples. However, simply duplicating existing samples might lead to overfitting, so methods like SMOTE (which you've mentioned) are often preferred for creating synthetic samples.
    • Undersampling: Reducing the number of instances in the majority class. This can involve randomly removing examples from the majority class or using more sophisticated methods that select which samples to remove based on certain criteria to maintain dataset integrity.
  • Ensemble Models: Instead of using a single model, ensemble methods combine several models to improve predictive performance. Here's how resampling ties into ensemble learning:
    • Different Ratios of Classes: You create multiple training sets where the ratio of minority to majority class varies. For instance, one subset might have a 1:1 ratio, another might have a 1:2 ratio, and so on. Each of these subsets is used to train a different model within the ensemble.
    • Bagging (Bootstrap Aggregating): With bagging, you can apply resampling to create multiple datasets with different class balances. Each dataset trains a unique model, and their predictions are combined (usually by voting for classification or averaging for regression).
    • Boosting: In boosting, subsequent models focus on the examples that previous models misclassified. If you resample to change class balance with each iteration, you can pay more attention to minority class instances, gradually correcting for the imbalance.
    • Random Forests: While inherently an ensemble method, you can manipulate the sampling of each tree to include different ratios of classes or use techniques like balanced random forests where each tree is built on a balanced bootstrap sample.
  • Implementation:
    • For each model in the ensemble, you would:
      • Resample your dataset to achieve a different class balance.
      • Train a classifier or regressor on this resampled data.
    • When making predictions:
      • Each model in the ensemble provides its prediction.
      • Aggregate these predictions (e.g., majority voting for classification, averaging for regression) to get the final prediction.
  • Advantages:
    • Reduces bias towards the majority class since each model sees different aspects of the dataset's distribution.
    • Potentially improves model generalization by focusing on different problem areas through different class balances.
  • Challenges:
    • Can be computationally expensive due to training multiple models.
    • Requires careful tuning of class ratios and ensemble parameters to avoid overfitting or underfitting.

Now lets make it more intuitive.

Example: Detecting Rare Diseases

Imagine you're a doctor trying to develop a model to diagnose a rare disease, let's call it "Zeta." Only 1% of the population has Zeta, making your dataset heavily imbalanced.

1. The Problem:

  • With a regular model trained on this data, your model might simply learn to predict "no Zeta" for almost every case because that's what it mostly sees in the data, leading to poor detection of actual Zeta cases.

2. Resampling with Ensemble Models:

  • Scenario 1: Pure Data - You start with your original dataset where 99% are non-Zeta cases and 1% are Zeta cases.
  • Scenario 2: Oversampling for Model A - For your first model (Model A), you increase the number of Zeta cases by duplicating or generating synthetic Zeta cases until you have a 50:50 split. Now, this model has seen more Zeta cases than in the real world, but it will be very sensitive to detecting Zeta.
  • Scenario 3: Undersampling for Model B - For another model (Model B), you reduce the number of non-Zeta cases until you have, say, a 10:90 split (10% Zeta, 90% non-Zeta). This model will be trained on less data but with a higher emphasis on recognizing Zeta.
  • Scenario 4: Balanced Mix for Model C - For a third model (Model C), you might aim for a 25:75 split, providing a moderate balance.

3. Ensemble Learning:

  • Training: Each of these models (A, B, C) is trained on its unique dataset.
    • Model A might be overly sensitive to Zeta but could catch cases others miss.
    • Model B might miss some cases but won't be overwhelmed by non-Zeta data.
    • Model C provides a middle ground.
  • Prediction: When diagnosing a patient:
    • Model A says "Zeta."
    • Model B says "No Zeta."
    • Model C says "Zeta."
  • Voting: In an ensemble, these votes get combined. If you use majority voting, you'd diagnose "Zeta" because two out of three models agreed.

4. Intuition:

  • Diversity in Perspective: By training different models on varied data compositions, you're like consulting multiple doctors who've seen different mixes of cases. One might be overly cautious (Model A), another might be too conservative (Model B), but together, they provide a balanced perspective.
  • Real-World Application: This method is akin to how a panel of doctors might diagnose a complex condition. Each doctor looks at the case from their unique experience (which in our analogy is the unique dataset they've been "trained" on), and together they make a more accurate diagnosis.
  • Outcome: This approach helps in not missing rare cases (Zeta) while still considering the common ones, improving overall diagnostic accuracy without being swayed too much by the imbalance in the dataset.

This ensemble approach with different class ratios ensures that the model isn't just learning from the majority class but is forced to learn from the minority class as well, across different models, leading to a more nuanced and effective diagnostic tool for rare conditions.