Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Simulating An Active Learning Cycle | Applied AL Concepts
Active Learning with Python

bookSimulating An Active Learning Cycle

To understand how an Active Learning (AL) cycle works in practice, you will walk through a minimal simulation using a synthetic dataset. The setup includes a small pool of data points, where only a few are initially labeled. In each iteration of the AL cycle, you will use uncertainty sampling to select the most informative unlabeled point, label it, and update your model. This process repeats for several rounds, demonstrating how the model improves as more data is selectively labeled.

123456789101112131415161718192021222324252627282930313233343536
import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Generate a synthetic binary classification dataset X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=42) # Split into initial labeled and unlabeled pools initial_idx = np.random.choice(range(100), size=5, replace=False) labeled_idx = list(initial_idx) unlabeled_idx = list(set(range(100)) - set(labeled_idx)) # Initialize model model = RandomForestClassifier(random_state=42) for iteration in range(5): # Train on current labeled set model.fit(X[labeled_idx], y[labeled_idx]) # Predict probabilities on the unlabeled pool probs = model.predict_proba(X[unlabeled_idx]) # Use uncertainty sampling: select sample with probability closest to 0.5 uncertainty = np.abs(probs[:, 1] - 0.5) query_idx = np.argmin(uncertainty) # Add queried sample to labeled pool new_label_idx = unlabeled_idx[query_idx] labeled_idx.append(new_label_idx) unlabeled_idx.remove(new_label_idx) # Print current progress y_pred = model.predict(X) acc = accuracy_score(y, y_pred) print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")
copy

This code demonstrates a full active learning cycle using a synthetic dataset and uncertainty sampling.

1. Generating the Synthetic Dataset

  • The code uses make_classification from scikit-learn to create a simple binary classification dataset with 100 samples and 2 features;
  • All features are informative, and there is no redundant information;
  • The dataset is reproducible thanks to the fixed random_state parameter.

2. Splitting into Labeled and Unlabeled Pools

  • A small subset of 5 samples is randomly selected as the initial labeled pool;
  • The remaining 95 samples form the unlabeled pool;
  • These pools simulate a real-world scenario where only a handful of data points are initially labeled.

3. Model Initialization

  • A RandomForestClassifier is created for binary classification;
  • The model will be retrained in each active learning iteration as new labels are acquired.

4. Active Learning Loop with Uncertainty Sampling

  • The loop runs for 5 iterations, simulating 5 rounds of active learning;
  • In each iteration:
    • The model is trained on the currently labeled data;
    • Predictions (class probabilities) are made for all unlabeled samples;
    • Uncertainty sampling is used: the sample whose predicted probability is closest to 0.5 (the most uncertain prediction) is selected;
    • This most uncertain sample is 'queried' (its true label is revealed), added to the labeled pool, and removed from the unlabeled pool;
    • The model is retrained with the expanded labeled set.

5. Tracking Model Accuracy

  • After each iteration, the model predicts labels for all samples;
  • The current accuracy is calculated and printed, showing how performance improves as new, informative samples are labeled;
  • The output demonstrates that even with a small number of labeled samples, active learning can quickly boost model accuracy by focusing on the most informative data points.

This simulation highlights the core advantage of active learning: efficiently improving a model with minimal labeled data by strategically selecting what to label next.

question mark

What is the main purpose of uncertainty sampling in an active learning cycle?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 3. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Suggested prompts:

Can you explain what uncertainty sampling is in more detail?

How does active learning differ from traditional supervised learning?

Can you suggest ways to visualize the active learning process?

bookSimulating An Active Learning Cycle

Sveip for å vise menyen

To understand how an Active Learning (AL) cycle works in practice, you will walk through a minimal simulation using a synthetic dataset. The setup includes a small pool of data points, where only a few are initially labeled. In each iteration of the AL cycle, you will use uncertainty sampling to select the most informative unlabeled point, label it, and update your model. This process repeats for several rounds, demonstrating how the model improves as more data is selectively labeled.

123456789101112131415161718192021222324252627282930313233343536
import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Generate a synthetic binary classification dataset X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=42) # Split into initial labeled and unlabeled pools initial_idx = np.random.choice(range(100), size=5, replace=False) labeled_idx = list(initial_idx) unlabeled_idx = list(set(range(100)) - set(labeled_idx)) # Initialize model model = RandomForestClassifier(random_state=42) for iteration in range(5): # Train on current labeled set model.fit(X[labeled_idx], y[labeled_idx]) # Predict probabilities on the unlabeled pool probs = model.predict_proba(X[unlabeled_idx]) # Use uncertainty sampling: select sample with probability closest to 0.5 uncertainty = np.abs(probs[:, 1] - 0.5) query_idx = np.argmin(uncertainty) # Add queried sample to labeled pool new_label_idx = unlabeled_idx[query_idx] labeled_idx.append(new_label_idx) unlabeled_idx.remove(new_label_idx) # Print current progress y_pred = model.predict(X) acc = accuracy_score(y, y_pred) print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")
copy

This code demonstrates a full active learning cycle using a synthetic dataset and uncertainty sampling.

1. Generating the Synthetic Dataset

  • The code uses make_classification from scikit-learn to create a simple binary classification dataset with 100 samples and 2 features;
  • All features are informative, and there is no redundant information;
  • The dataset is reproducible thanks to the fixed random_state parameter.

2. Splitting into Labeled and Unlabeled Pools

  • A small subset of 5 samples is randomly selected as the initial labeled pool;
  • The remaining 95 samples form the unlabeled pool;
  • These pools simulate a real-world scenario where only a handful of data points are initially labeled.

3. Model Initialization

  • A RandomForestClassifier is created for binary classification;
  • The model will be retrained in each active learning iteration as new labels are acquired.

4. Active Learning Loop with Uncertainty Sampling

  • The loop runs for 5 iterations, simulating 5 rounds of active learning;
  • In each iteration:
    • The model is trained on the currently labeled data;
    • Predictions (class probabilities) are made for all unlabeled samples;
    • Uncertainty sampling is used: the sample whose predicted probability is closest to 0.5 (the most uncertain prediction) is selected;
    • This most uncertain sample is 'queried' (its true label is revealed), added to the labeled pool, and removed from the unlabeled pool;
    • The model is retrained with the expanded labeled set.

5. Tracking Model Accuracy

  • After each iteration, the model predicts labels for all samples;
  • The current accuracy is calculated and printed, showing how performance improves as new, informative samples are labeled;
  • The output demonstrates that even with a small number of labeled samples, active learning can quickly boost model accuracy by focusing on the most informative data points.

This simulation highlights the core advantage of active learning: efficiently improving a model with minimal labeled data by strategically selecting what to label next.

question mark

What is the main purpose of uncertainty sampling in an active learning cycle?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 3. Kapittel 1
some-alt