Learn Query-By-Committee (QBC)

Swipe to show menu

Query-By-Committee (QBC) is an active learning strategy that leverages the collective wisdom of several models, called a committee, to identify which unlabeled samples are most informative. Instead of relying on a single model's uncertainty, QBC selects samples where the committee members most disagree, under the assumption that such disagreement indicates areas where the models are uncertain or lack sufficient information. The rationale is that by presenting these contentious samples to an oracle (such as a human annotator), you can quickly resolve the points of confusion and accelerate learning. Disagreement can be measured in several ways, but a common approach is to look at how committee members vote on the predicted label for each sample.


              1234567891011121314151617181920212223242526272829303132333435363738394041
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from scipy.stats import entropy

# Generate a toy classification dataset
X, y = make_classification(n_samples=200, n_features=10, n_informative=5, random_state=42)
X_train, X_pool, y_train, y_pool = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize committee of classifiers
committee = [
    RandomForestClassifier(n_estimators=10, random_state=0),
    LogisticRegression(max_iter=1000, random_state=1),
    SVC(probability=True, random_state=2)
]

# Train each classifier on the labeled training set
for clf in committee:
    clf.fit(X_train, y_train)

# Get committee predictions on the pool set
predictions = np.array([clf.predict(X_pool) for clf in committee])  # shape: (n_committee, n_samples)

# For each sample, count votes for each class
n_classes = len(np.unique(y))
vote_counts = np.zeros((X_pool.shape[0], n_classes))
for i in range(X_pool.shape[0]):
    for pred in predictions[:, i]:
        vote_counts[i, pred] += 1

# Calculate vote entropy for each sample (higher = more disagreement)
vote_probs = vote_counts / len(committee)
vote_entropies = entropy(vote_probs.T)

# Show top 5 samples with highest disagreement
top_indices = np.argsort(-vote_entropies)[:5]
for idx in top_indices:
    print(f"Sample {idx}: Vote Entropy = {vote_entropies[idx]:.3f}, Votes = {vote_counts[idx]}")

Note

While QBC can provide a richer measure of uncertainty by harnessing diverse model perspectives, it comes at a computational cost. Training and maintaining multiple models increases resource usage, especially as the committee grows or as models become more complex. Striking a balance between committee diversity (which improves disagreement detection) and computational efficiency is crucial for practical active learning with QBC.

1. What is the main advantage of QBC over single-model uncertainty sampling?

2. Which metric can be used to quantify committee disagreement?

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 3