Apprendre Learning Efficiency Curves

Understanding how efficiently an Active Learning (AL) system improves with more labeled data is crucial for evaluating its effectiveness. Learning curves provide a visual tool for this purpose: they plot model accuracy (or another performance metric) against the number of labeled samples acquired during AL iterations. These curves help you see how quickly your model benefits from new information, and how much data is needed to reach a desired level of performance. In AL, the goal is to achieve high accuracy with as few labeled samples as possible, so the shape and steepness of your learning curve can reveal how well your sampling strategy is working.


              123456789101112131415161718192021222324252627282930313233343536373839
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Simulate a pool of unlabeled data
X, y = make_classification(n_samples=1200, n_features=20, n_informative=15, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Start with a small labeled set
initial_idx = np.random.choice(range(len(X_train)), size=20, replace=False)
labeled_idx = list(initial_idx)
unlabeled_idx = list(set(range(len(X_train))) - set(labeled_idx))

accuracies = []
labeled_set_sizes = []

# Simulate AL iterations
for i in range(10):
    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_train[labeled_idx], y_train[labeled_idx])
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    labeled_set_sizes.append(len(labeled_idx))
    # Select 20 most uncertain samples (simulate with random selection here)
    if len(unlabeled_idx) >= 20:
        new_samples = np.random.choice(unlabeled_idx, size=20, replace=False)
        labeled_idx.extend(new_samples)
        unlabeled_idx = list(set(unlabeled_idx) - set(new_samples))

plt.plot(labeled_set_sizes, accuracies, marker='o')
plt.xlabel('Number of Labeled Samples')
plt.ylabel('Accuracy')
plt.title('Learning Curve: Accuracy vs. Labeled Set Size')
plt.grid(True)
plt.show()

Note

A learning curve in Active Learning shows how efficiently a model improves as more labeled data is added. A steep curve means rapid accuracy gains from each new label—this is ideal. A flat curve suggests new labels add little value. Comparing curves helps you see which AL strategy achieves high accuracy with fewer labels.

1. What does a steeper learning curve indicate in the context of Active Learning?

2. Which metric is most relevant for comparing AL strategies?

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain how the active learning sampling strategy works in this example?

What does the learning curve tell us about the model's performance?

How could I modify this code to use a different uncertainty sampling method?