Oppiskele Margin And Entropy Sampling

Margin sampling and entropy sampling are two widely used query strategies in active learning, both designed to identify the most informative unlabeled samples for labeling. Margin sampling focuses on the difference between the highest and the second-highest predicted class probabilities for each sample. The smaller this margin, the less confident the model is about its prediction, signaling a more uncertain and potentially informative example. In contrast, entropy sampling quantifies uncertainty using the entropy of the predicted class probability distribution for each sample. Entropy measures the amount of uncertainty or randomness; higher entropy values indicate that the model is less certain about its prediction across all possible classes, rather than just the top two.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Create a more complex, noisy dataset
X, y = make_classification(
    n_samples=600,
    n_features=6,
    n_informative=3,
    n_redundant=1,
    n_clusters_per_class=2,
    flip_y=0.15,              # adds label noise → much more uncertainty
    class_sep=0.6,            # higher overlap between classes
    random_state=42
)

# Train a weaker classifier to increase uncertainty
clf = LogisticRegression(max_iter=2000)
clf.fit(X, y)

# Take a batch from the dataset
probs = clf.predict_proba(X[:5])

# Margin sampling
margins = []
for prob in probs:
    sorted_probs = np.sort(prob)[::-1]
    margin = sorted_probs[0] - sorted_probs[1]
    margins.append(margin)

# Entropy sampling
entropies = []
for prob in probs:
    entropy = -np.sum(prob * np.log(prob + 1e-12))
    entropies.append(entropy)

print("Class probabilities for each sample:")
print(probs.round(4))

print("\nMargin values (smaller = more uncertain):")
print([round(m, 4) for m in margins])

print("\nEntropy values (higher = more uncertain):")
print([round(e, 4) for e in entropies])

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain when to prefer margin sampling over entropy sampling?

What do the margin and entropy values in the output tell us about the model's uncertainty?

Can you summarize the main differences between margin and entropy sampling?

Pyyhkäise näyttääksesi valikon


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Create a more complex, noisy dataset
X, y = make_classification(
    n_samples=600,
    n_features=6,
    n_informative=3,
    n_redundant=1,
    n_clusters_per_class=2,
    flip_y=0.15,              # adds label noise → much more uncertainty
    class_sep=0.6,            # higher overlap between classes
    random_state=42
)

# Train a weaker classifier to increase uncertainty
clf = LogisticRegression(max_iter=2000)
clf.fit(X, y)

# Take a batch from the dataset
probs = clf.predict_proba(X[:5])

# Margin sampling
margins = []
for prob in probs:
    sorted_probs = np.sort(prob)[::-1]
    margin = sorted_probs[0] - sorted_probs[1]
    margins.append(margin)

# Entropy sampling
entropies = []
for prob in probs:
    entropy = -np.sum(prob * np.log(prob + 1e-12))
    entropies.append(entropy)

print("Class probabilities for each sample:")
print(probs.round(4))

print("\nMargin values (smaller = more uncertain):")
print([round(m, 4) for m in margins])

print("\nEntropy values (higher = more uncertain):")
print([round(e, 4) for e in entropies])

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2