Query-By-Committee (QBC)
Query-By-Committee (QBC) is an active learning strategy that leverages the collective wisdom of several models, called a committee, to identify which unlabeled samples are most informative. Instead of relying on a single model's uncertainty, QBC selects samples where the committee members most disagree, under the assumption that such disagreement indicates areas where the models are uncertain or lack sufficient information. The rationale is that by presenting these contentious samples to an oracle (such as a human annotator), you can quickly resolve the points of confusion and accelerate learning. Disagreement can be measured in several ways, but a common approach is to look at how committee members vote on the predicted label for each sample.
1234567891011121314151617181920212223242526272829303132333435363738394041import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import train_test_split from scipy.stats import entropy # Generate a toy classification dataset X, y = make_classification(n_samples=200, n_features=10, n_informative=5, random_state=42) X_train, X_pool, y_train, y_pool = train_test_split(X, y, test_size=0.5, random_state=42) # Initialize committee of classifiers committee = [ RandomForestClassifier(n_estimators=10, random_state=0), LogisticRegression(max_iter=1000, random_state=1), SVC(probability=True, random_state=2) ] # Train each classifier on the labeled training set for clf in committee: clf.fit(X_train, y_train) # Get committee predictions on the pool set predictions = np.array([clf.predict(X_pool) for clf in committee]) # shape: (n_committee, n_samples) # For each sample, count votes for each class n_classes = len(np.unique(y)) vote_counts = np.zeros((X_pool.shape[0], n_classes)) for i in range(X_pool.shape[0]): for pred in predictions[:, i]: vote_counts[i, pred] += 1 # Calculate vote entropy for each sample (higher = more disagreement) vote_probs = vote_counts / len(committee) vote_entropies = entropy(vote_probs.T) # Show top 5 samples with highest disagreement top_indices = np.argsort(-vote_entropies)[:5] for idx in top_indices: print(f"Sample {idx}: Vote Entropy = {vote_entropies[idx]:.3f}, Votes = {vote_counts[idx]}")
While QBC can provide a richer measure of uncertainty by harnessing diverse model perspectives, it comes at a computational cost. Training and maintaining multiple models increases resource usage, especially as the committee grows or as models become more complex. Striking a balance between committee diversity (which improves disagreement detection) and computational efficiency is crucial for practical active learning with QBC.
1. What is the main advantage of QBC over single-model uncertainty sampling?
2. Which metric can be used to quantify committee disagreement?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 10
Query-By-Committee (QBC)
Swipe to show menu
Query-By-Committee (QBC) is an active learning strategy that leverages the collective wisdom of several models, called a committee, to identify which unlabeled samples are most informative. Instead of relying on a single model's uncertainty, QBC selects samples where the committee members most disagree, under the assumption that such disagreement indicates areas where the models are uncertain or lack sufficient information. The rationale is that by presenting these contentious samples to an oracle (such as a human annotator), you can quickly resolve the points of confusion and accelerate learning. Disagreement can be measured in several ways, but a common approach is to look at how committee members vote on the predicted label for each sample.
1234567891011121314151617181920212223242526272829303132333435363738394041import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import train_test_split from scipy.stats import entropy # Generate a toy classification dataset X, y = make_classification(n_samples=200, n_features=10, n_informative=5, random_state=42) X_train, X_pool, y_train, y_pool = train_test_split(X, y, test_size=0.5, random_state=42) # Initialize committee of classifiers committee = [ RandomForestClassifier(n_estimators=10, random_state=0), LogisticRegression(max_iter=1000, random_state=1), SVC(probability=True, random_state=2) ] # Train each classifier on the labeled training set for clf in committee: clf.fit(X_train, y_train) # Get committee predictions on the pool set predictions = np.array([clf.predict(X_pool) for clf in committee]) # shape: (n_committee, n_samples) # For each sample, count votes for each class n_classes = len(np.unique(y)) vote_counts = np.zeros((X_pool.shape[0], n_classes)) for i in range(X_pool.shape[0]): for pred in predictions[:, i]: vote_counts[i, pred] += 1 # Calculate vote entropy for each sample (higher = more disagreement) vote_probs = vote_counts / len(committee) vote_entropies = entropy(vote_probs.T) # Show top 5 samples with highest disagreement top_indices = np.argsort(-vote_entropies)[:5] for idx in top_indices: print(f"Sample {idx}: Vote Entropy = {vote_entropies[idx]:.3f}, Votes = {vote_counts[idx]}")
While QBC can provide a richer measure of uncertainty by harnessing diverse model perspectives, it comes at a computational cost. Training and maintaining multiple models increases resource usage, especially as the committee grows or as models become more complex. Striking a balance between committee diversity (which improves disagreement detection) and computational efficiency is crucial for practical active learning with QBC.
1. What is the main advantage of QBC over single-model uncertainty sampling?
2. Which metric can be used to quantify committee disagreement?
Thanks for your feedback!