Simulating An Active Learning Cycle
To understand how an Active Learning (AL) cycle works in practice, you will walk through a minimal simulation using a synthetic dataset. The setup includes a small pool of data points, where only a few are initially labeled. In each iteration of the AL cycle, you will use uncertainty sampling to select the most informative unlabeled point, label it, and update your model. This process repeats for several rounds, demonstrating how the model improves as more data is selectively labeled.
123456789101112131415161718192021222324252627282930313233343536import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Generate a synthetic binary classification dataset X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=42) # Split into initial labeled and unlabeled pools initial_idx = np.random.choice(range(100), size=5, replace=False) labeled_idx = list(initial_idx) unlabeled_idx = list(set(range(100)) - set(labeled_idx)) # Initialize model model = RandomForestClassifier(random_state=42) for iteration in range(5): # Train on current labeled set model.fit(X[labeled_idx], y[labeled_idx]) # Predict probabilities on the unlabeled pool probs = model.predict_proba(X[unlabeled_idx]) # Use uncertainty sampling: select sample with probability closest to 0.5 uncertainty = np.abs(probs[:, 1] - 0.5) query_idx = np.argmin(uncertainty) # Add queried sample to labeled pool new_label_idx = unlabeled_idx[query_idx] labeled_idx.append(new_label_idx) unlabeled_idx.remove(new_label_idx) # Print current progress y_pred = model.predict(X) acc = accuracy_score(y, y_pred) print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")
This code demonstrates a full active learning cycle using a synthetic dataset and uncertainty sampling.
1. Generating the Synthetic Dataset
- The code uses
make_classificationfromscikit-learnto create a simple binary classification dataset with 100 samples and 2 features; - All features are informative, and there is no redundant information;
- The dataset is reproducible thanks to the fixed
random_stateparameter.
2. Splitting into Labeled and Unlabeled Pools
- A small subset of 5 samples is randomly selected as the initial labeled pool;
- The remaining 95 samples form the unlabeled pool;
- These pools simulate a real-world scenario where only a handful of data points are initially labeled.
3. Model Initialization
- A
RandomForestClassifieris created for binary classification; - The model will be retrained in each active learning iteration as new labels are acquired.
4. Active Learning Loop with Uncertainty Sampling
- The loop runs for 5 iterations, simulating 5 rounds of active learning;
- In each iteration:
- The model is trained on the currently labeled data;
- Predictions (class probabilities) are made for all unlabeled samples;
- Uncertainty sampling is used: the sample whose predicted probability is closest to 0.5 (the most uncertain prediction) is selected;
- This most uncertain sample is 'queried' (its true label is revealed), added to the labeled pool, and removed from the unlabeled pool;
- The model is retrained with the expanded labeled set.
5. Tracking Model Accuracy
- After each iteration, the model predicts labels for all samples;
- The current accuracy is calculated and printed, showing how performance improves as new, informative samples are labeled;
- The output demonstrates that even with a small number of labeled samples, active learning can quickly boost model accuracy by focusing on the most informative data points.
This simulation highlights the core advantage of active learning: efficiently improving a model with minimal labeled data by strategically selecting what to label next.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Awesome!
Completion rate improved to 10
Simulating An Active Learning Cycle
Desliza para mostrar el menú
To understand how an Active Learning (AL) cycle works in practice, you will walk through a minimal simulation using a synthetic dataset. The setup includes a small pool of data points, where only a few are initially labeled. In each iteration of the AL cycle, you will use uncertainty sampling to select the most informative unlabeled point, label it, and update your model. This process repeats for several rounds, demonstrating how the model improves as more data is selectively labeled.
123456789101112131415161718192021222324252627282930313233343536import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Generate a synthetic binary classification dataset X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=42) # Split into initial labeled and unlabeled pools initial_idx = np.random.choice(range(100), size=5, replace=False) labeled_idx = list(initial_idx) unlabeled_idx = list(set(range(100)) - set(labeled_idx)) # Initialize model model = RandomForestClassifier(random_state=42) for iteration in range(5): # Train on current labeled set model.fit(X[labeled_idx], y[labeled_idx]) # Predict probabilities on the unlabeled pool probs = model.predict_proba(X[unlabeled_idx]) # Use uncertainty sampling: select sample with probability closest to 0.5 uncertainty = np.abs(probs[:, 1] - 0.5) query_idx = np.argmin(uncertainty) # Add queried sample to labeled pool new_label_idx = unlabeled_idx[query_idx] labeled_idx.append(new_label_idx) unlabeled_idx.remove(new_label_idx) # Print current progress y_pred = model.predict(X) acc = accuracy_score(y, y_pred) print(f"Iteration {iteration+1}: Labeled samples = {len(labeled_idx)}, Accuracy = {acc:.2f}")
This code demonstrates a full active learning cycle using a synthetic dataset and uncertainty sampling.
1. Generating the Synthetic Dataset
- The code uses
make_classificationfromscikit-learnto create a simple binary classification dataset with 100 samples and 2 features; - All features are informative, and there is no redundant information;
- The dataset is reproducible thanks to the fixed
random_stateparameter.
2. Splitting into Labeled and Unlabeled Pools
- A small subset of 5 samples is randomly selected as the initial labeled pool;
- The remaining 95 samples form the unlabeled pool;
- These pools simulate a real-world scenario where only a handful of data points are initially labeled.
3. Model Initialization
- A
RandomForestClassifieris created for binary classification; - The model will be retrained in each active learning iteration as new labels are acquired.
4. Active Learning Loop with Uncertainty Sampling
- The loop runs for 5 iterations, simulating 5 rounds of active learning;
- In each iteration:
- The model is trained on the currently labeled data;
- Predictions (class probabilities) are made for all unlabeled samples;
- Uncertainty sampling is used: the sample whose predicted probability is closest to 0.5 (the most uncertain prediction) is selected;
- This most uncertain sample is 'queried' (its true label is revealed), added to the labeled pool, and removed from the unlabeled pool;
- The model is retrained with the expanded labeled set.
5. Tracking Model Accuracy
- After each iteration, the model predicts labels for all samples;
- The current accuracy is calculated and printed, showing how performance improves as new, informative samples are labeled;
- The output demonstrates that even with a small number of labeled samples, active learning can quickly boost model accuracy by focusing on the most informative data points.
This simulation highlights the core advantage of active learning: efficiently improving a model with minimal labeled data by strategically selecting what to label next.
¡Gracias por tus comentarios!