Density-Weighted Sampling
Density-weighted sampling is a strategy in active learning that helps you select the most valuable data points for labeling. Unlike pure uncertainty sampling, which focuses only on how uncertain the model is about each sample, density-weighted sampling also considers how representative a sample is within the data distribution. The intuition is simple: you want to prioritize not just uncertain points, but also those that are typical of the dataset, avoiding rare outliers that may not help the model generalize. By combining informativeness (such as uncertainty) with sample density, you can focus your labeling effort on data points that both challenge the model and represent common patterns in your data.
12345678910111213141516171819202122232425262728293031import numpy as np from sklearn.neighbors import NearestNeighbors from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Generate a synthetic dataset X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42) # Train a simple classifier clf = RandomForestClassifier(random_state=42) clf.fit(X, y) # Compute uncertainty: use predicted class probabilities probs = clf.predict_proba(X) uncertainty = 1 - np.max(probs, axis=1) # Least confident score # Estimate sample density using k-nearest neighbors k = 5 nbrs = NearestNeighbors(n_neighbors=k+1) # +1 because the point itself is included nbrs.fit(X) distances, _ = nbrs.kneighbors(X) density = 1 / (np.mean(distances[:, 1:], axis=1) + 1e-10) # Avoid division by zero # Combine uncertainty and density (simple product) density_weighted_score = uncertainty * density # Select the top 10 samples by density-weighted score top_indices = np.argsort(-density_weighted_score)[:10] print("Indices of top 10 density-weighted samples:", top_indices)
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Awesome!
Completion rate improved to 10
Density-Weighted Sampling
Scorri per mostrare il menu
Density-weighted sampling is a strategy in active learning that helps you select the most valuable data points for labeling. Unlike pure uncertainty sampling, which focuses only on how uncertain the model is about each sample, density-weighted sampling also considers how representative a sample is within the data distribution. The intuition is simple: you want to prioritize not just uncertain points, but also those that are typical of the dataset, avoiding rare outliers that may not help the model generalize. By combining informativeness (such as uncertainty) with sample density, you can focus your labeling effort on data points that both challenge the model and represent common patterns in your data.
12345678910111213141516171819202122232425262728293031import numpy as np from sklearn.neighbors import NearestNeighbors from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Generate a synthetic dataset X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42) # Train a simple classifier clf = RandomForestClassifier(random_state=42) clf.fit(X, y) # Compute uncertainty: use predicted class probabilities probs = clf.predict_proba(X) uncertainty = 1 - np.max(probs, axis=1) # Least confident score # Estimate sample density using k-nearest neighbors k = 5 nbrs = NearestNeighbors(n_neighbors=k+1) # +1 because the point itself is included nbrs.fit(X) distances, _ = nbrs.kneighbors(X) density = 1 / (np.mean(distances[:, 1:], axis=1) + 1e-10) # Avoid division by zero # Combine uncertainty and density (simple product) density_weighted_score = uncertainty * density # Select the top 10 samples by density-weighted score top_indices = np.argsort(-density_weighted_score)[:10] print("Indices of top 10 density-weighted samples:", top_indices)
Grazie per i tuoi commenti!