Aprende Simulating An Active Learning Cycle

To understand how an Active Learning (AL) cycle works in practice, you will walk through a minimal simulation using a synthetic dataset. The setup includes a small pool of data points, where only a few are initially labeled. In each iteration of the AL cycle, you will use uncertainty sampling to select the most informative unlabeled point, label it, and update your model. This process repeats for several rounds, demonstrating how the model improves as more data is selectively labeled.


              123456789101112131415161718192021222324252627282930313233343536
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
                           n_redundant=0, n_classes=2, random_state=42)

# Split into initial labeled and unlabeled pools
initial_idx = np.random.choice(range(100), size=5, replace=False)
labeled_idx = list(initial_idx)
unlabeled_idx = list(set(range(100)) - set(labeled_idx))

# Initialize model
model = RandomForestClassifier(random_state=42)

for iteration in range(5):
    # Train on current labeled set
    model.fit(X[labeled_idx], y[labeled_idx])

    # Predict probabilities on the unlabeled pool
    probs = model.predict_proba(X[unlabeled_idx])
    # Use uncertainty sampling: select sample with probability closest to 0.5
    uncertainty = np.abs(probs[:, 1] - 0.5)
    query_idx = np.argmin(uncertainty)

    # Add queried sample to labeled pool
    new_label_idx = unlabeled_idx[query_idx]
    labeled_idx.append(new_label_idx)
    unlabeled_idx.remove(new_label_idx)

    # Print current progress
    y_pred = model.predict(X)
    acc = accuracy_score(y, y_pred)
    print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")

This code demonstrates a full active learning cycle using a synthetic dataset and uncertainty sampling.

1. Generating the Synthetic Dataset

The code uses make_classification from scikit-learn to create a simple binary classification dataset with 100 samples and 2 features;
All features are informative, and there is no redundant information;
The dataset is reproducible thanks to the fixed random_state parameter.

2. Splitting into Labeled and Unlabeled Pools

A small subset of 5 samples is randomly selected as the initial labeled pool;
The remaining 95 samples form the unlabeled pool;
These pools simulate a real-world scenario where only a handful of data points are initially labeled.

3. Model Initialization

A RandomForestClassifier is created for binary classification;
The model will be retrained in each active learning iteration as new labels are acquired.

4. Active Learning Loop with Uncertainty Sampling

The loop runs for 5 iterations, simulating 5 rounds of active learning;
In each iteration:
- The model is trained on the currently labeled data;
- Predictions (class probabilities) are made for all unlabeled samples;
- Uncertainty sampling is used: the sample whose predicted probability is closest to 0.5 (the most uncertain prediction) is selected;
- This most uncertain sample is 'queried' (its true label is revealed), added to the labeled pool, and removed from the unlabeled pool;
- The model is retrained with the expanded labeled set.

5. Tracking Model Accuracy

After each iteration, the model predicts labels for all samples;
The current accuracy is calculated and printed, showing how performance improves as new, informative samples are labeled;
The output demonstrates that even with a small number of labeled samples, active learning can quickly boost model accuracy by focusing on the most informative data points.

This simulation highlights the core advantage of active learning: efficiently improving a model with minimal labeled data by strategically selecting what to label next.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 3. Capítulo 1

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Desliza para mostrar el menú


              123456789101112131415161718192021222324252627282930313233343536
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
                           n_redundant=0, n_classes=2, random_state=42)

# Split into initial labeled and unlabeled pools
initial_idx = np.random.choice(range(100), size=5, replace=False)
labeled_idx = list(initial_idx)
unlabeled_idx = list(set(range(100)) - set(labeled_idx))

# Initialize model
model = RandomForestClassifier(random_state=42)

for iteration in range(5):
    # Train on current labeled set
    model.fit(X[labeled_idx], y[labeled_idx])

    # Predict probabilities on the unlabeled pool
    probs = model.predict_proba(X[unlabeled_idx])
    # Use uncertainty sampling: select sample with probability closest to 0.5
    uncertainty = np.abs(probs[:, 1] - 0.5)
    query_idx = np.argmin(uncertainty)

    # Add queried sample to labeled pool
    new_label_idx = unlabeled_idx[query_idx]
    labeled_idx.append(new_label_idx)
    unlabeled_idx.remove(new_label_idx)

    # Print current progress
    y_pred = model.predict(X)
    acc = accuracy_score(y, y_pred)
    print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")