Apprendre Classical Anonymization Techniques | Foundations of Data Privacy

Glissez pour afficher le menu

Understanding how to protect sensitive information in datasets is a cornerstone of data privacy. Before the rise of differential privacy, several classical anonymization techniques were developed to reduce the risk of re-identifying individuals in shared data. The three most influential are k-anonymity, l-diversity, and t-closeness. Each aims to mask identities by modifying or grouping data, but they differ in approach and strength.

K-anonymity ensures that any record in a released dataset is indistinguishable from at least k-1 other records based on a set of identifying attributes, called quasi-identifiers. In other words, an attacker cannot confidently link a record to a unique individual if at least k records share the same quasi-identifier values.

L-diversity builds upon k-anonymity by ensuring that sensitive attributes within each group of indistinguishable records are diverse enough. This protects against situations where all records in a k-anonymous group have the same sensitive value, which would still reveal private information.

T-closeness further strengthens privacy by requiring that the distribution of sensitive attributes within each group is close to their distribution in the overall dataset. This prevents attackers from learning too much about sensitive values even within diverse groups.

The main differences between these techniques lie in the types of attacks they mitigate. K-anonymity prevents straightforward re-identification, l-diversity guards against attribute disclosure from homogeneous groups, and t-closeness limits information gain about sensitive attributes.

K-anonymity

Age	Zipcode	Disease
25	12345	Flu
25	12345	Cold
25	12345	Allergy

Here, quasi-identifiers Age and Zipcode are the same for three records, achieving 3-anonymity.

L-diversity

Age	Zipcode	Disease
30	54321	Cancer
30	54321	Flu
30	54321	Cold

This group is not only 3-anonymous but also 3-diverse, as there are three different diseases in the group, reducing the risk of deducing the sensitive value.

T-closeness

Age	Zipcode	Disease
40	67890	Allergy
40	67890	Allergy
40	67890	Cold

If Allergy makes up 2/3 of this group but only 10% of the overall dataset, the group does not satisfy t-closeness, as the sensitive value's distribution is too different from the overall dataset.

Study More

Classical anonymization techniques like k-anonymity, l-diversity, and t-closeness can be vulnerable to modern attacks, such as those exploiting background knowledge or linking with external datasets. These limitations motivate the development of stronger privacy models, including differential privacy.


              1234567891011121314
            
import pandas as pd

# Synthetic dataset
data = pd.DataFrame({
    "Age": [25, 25, 25, 30, 30, 30, 40, 40, 40],
    "Zipcode": [12345, 12345, 12345, 54321, 54321, 54321, 67890, 67890, 67890],
    "Disease": ["Flu", "Cold", "Allergy", "Cancer", "Flu", "Cold", "Allergy", "Allergy", "Cold"]
})

# Group data by quasi-identifiers for k-anonymity (k=3)
k_anonymous_groups = data.groupby(["Age", "Zipcode"])
for name, group in k_anonymous_groups:
    print(f"Group: Age={name[0]}, Zipcode={name[1]}")
    print(group, end="\n\n")

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 2

Classical Anonymization Techniques

1. Which of the following best defines k-anonymity?

2. How does l-diversity differ from t-closeness?

3. Why might k-anonymity, l-diversity, and t-closeness fail to protect privacy in some cases?