Вивчайте Density-Weighted Sampling

Density-weighted sampling is a strategy in active learning that helps you select the most valuable data points for labeling. Unlike pure uncertainty sampling, which focuses only on how uncertain the model is about each sample, density-weighted sampling also considers how representative a sample is within the data distribution. The intuition is simple: you want to prioritize not just uncertain points, but also those that are typical of the dataset, avoiding rare outliers that may not help the model generalize. By combining informativeness (such as uncertainty) with sample density, you can focus your labeling effort on data points that both challenge the model and represent common patterns in your data.


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate a synthetic dataset
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)

# Train a simple classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X, y)

# Compute uncertainty: use predicted class probabilities
probs = clf.predict_proba(X)
uncertainty = 1 - np.max(probs, axis=1)  # Least confident score

# Estimate sample density using k-nearest neighbors
k = 5
nbrs = NearestNeighbors(n_neighbors=k+1)  # +1 because the point itself is included
nbrs.fit(X)
distances, _ = nbrs.kneighbors(X)
density = 1 / (np.mean(distances[:, 1:], axis=1) + 1e-10)  # Avoid division by zero

# Combine uncertainty and density (simple product)
density_weighted_score = uncertainty * density

# Select the top 10 samples by density-weighted score
top_indices = np.argsort(-density_weighted_score)[:10]

print("Indices of top 10 density-weighted samples:", top_indices)

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 4

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Свайпніть щоб показати меню


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate a synthetic dataset
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)

# Train a simple classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X, y)

# Compute uncertainty: use predicted class probabilities
probs = clf.predict_proba(X)
uncertainty = 1 - np.max(probs, axis=1)  # Least confident score

# Estimate sample density using k-nearest neighbors
k = 5
nbrs = NearestNeighbors(n_neighbors=k+1)  # +1 because the point itself is included
nbrs.fit(X)
distances, _ = nbrs.kneighbors(X)
density = 1 / (np.mean(distances[:, 1:], axis=1) + 1e-10)  # Avoid division by zero

# Combine uncertainty and density (simple product)
density_weighted_score = uncertainty * density

# Select the top 10 samples by density-weighted score
top_indices = np.argsort(-density_weighted_score)[:10]

print("Indices of top 10 density-weighted samples:", top_indices)

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 4