Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Evaluating Deduplication Results | Deduplication Strategies
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Data Cleaning Techniques in Python

bookEvaluating Deduplication Results

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

bookEvaluating Deduplication Results

Deslize para mostrar o menu

Evaluating deduplication results means checking how accurately your process finds and removes duplicates without deleting unique records. You use three metrics:

  • Precision: the proportion of records flagged as duplicates that were truly duplicates. High precision means few false positives.
  • Recall: the proportion of all actual duplicates that were correctly identified and removed. High recall means few true duplicates were missed.
  • F1-score: the harmonic mean of precision and recall, giving you a single value to compare deduplication strategies.

Track counts before and after deduplication—like total records, detected duplicates, and true duplicates removed—to calculate these metrics.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pd from difflib import SequenceMatcher from sklearn.metrics import precision_score, recall_score, f1_score # Two datasets from different sources df_a = pd.DataFrame({ "id_a": [1, 2, 3, 4], "name": ["Apple iPhone 14", "Samsung Galaxy S22", "Sony WH1000 XM5", "Dell Inspiron 15"], "price": [999, 899, 350, 700] }) df_b = pd.DataFrame({ "id_b": ["A", "B", "C", "D"], "name": ["Iphone 14", "Galaxy S-22", "Sony WH-1000XM5", "Inspiron 15 DELL"], "price": [995, 900, 349, 705] }) # Ground truth for duplicates (1 = duplicate pair, 0 = not duplicate) y_true = [1, 1, 1, 0] # Last pair intentionally marked as non-duplicate # Generate similarity features def name_similarity(a, b): return SequenceMatcher(None, a, b).ratio() def price_difference(a, b): return abs(a - b) / max(a, b) pairs = [] for i in range(len(df_a)): sim_name = name_similarity(df_a.loc[i, "name"], df_b.loc[i, "name"]) diff_price = price_difference(df_a.loc[i, "price"], df_b.loc[i, "price"]) pairs.append([sim_name, diff_price]) pairs_df = pd.DataFrame(pairs, columns=["name_similarity", "price_diff"]) # Simple duplicate classification rule # High name similarity AND low price difference → duplicate y_pred = [] for _, row in pairs_df.iterrows(): if row["name_similarity"] > 0.75 and row["price_diff"] < 0.05: y_pred.append(1) else: y_pred.append(0) # Metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) print("Similarity pairs:") print(pairs_df) print("\nMetrics:") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
copy
question mark

Which metric would be most important if you want to minimize false positives in deduplication

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2
some-alt