Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Re-Identification Risks and Privacy Attacks | Foundations of Data Privacy
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Data Privacy and Differential Privacy Fundamentals

bookRe-Identification Risks and Privacy Attacks

Re-identification is a significant risk in the field of data privacy, especially when datasets that have been "anonymized" are released for research, analysis, or public use. Even after removing direct identifiers such as names or social security numbers, attackers can often use a combination of seemingly innocuous attributes — like ZIP code, birthdate, or gender — to uniquely identify individuals. These attributes, while not unique on their own, can become unique when combined, making re-identification possible.

Attackers exploit what is known as auxiliary information — data from external sources that can be cross-referenced with the anonymized dataset. For example, public voter records or social media profiles may contain attributes that, when matched with the anonymized data, enable the identification of individuals. This process is called a privacy attack, and it demonstrates that simply removing direct identifiers is not enough to guarantee privacy.

Note
Definition

Quasi-identifiers are sets of attributes in a dataset that, while not unique identifiers by themselves, can be combined with external information to re-identify individuals. Examples include combinations like ZIP code, birthdate, and gender.

123456789101112131415
import pandas as pd # Create a synthetic dataset data = { "zip_code": ["02138", "02139", "02138", "02140"], "birthdate": ["1980-05-12", "1975-09-23", "1990-11-02", "1980-05-12"], "gender": ["F", "M", "F", "M"] } df = pd.DataFrame(data) # Check for unique combinations of quasi-identifiers unique_rows = df.groupby(["zip_code", "birthdate"]).size().reset_index(name='count') print("Combinations of ZIP code and birthdate:") print(unique_rows)
copy

1. Which of the following best explains why re-identification is possible in anonymized datasets?

2. How does auxiliary information contribute to privacy attacks?

question mark

Which of the following best explains why re-identification is possible in anonymized datasets?

Select the correct answer

question mark

How does auxiliary information contribute to privacy attacks?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you explain how attackers use auxiliary information in practice?

What are some methods to prevent re-identification in datasets?

Can you give more examples of quasi-identifiers?

bookRe-Identification Risks and Privacy Attacks

Swipe um das Menü anzuzeigen

Re-identification is a significant risk in the field of data privacy, especially when datasets that have been "anonymized" are released for research, analysis, or public use. Even after removing direct identifiers such as names or social security numbers, attackers can often use a combination of seemingly innocuous attributes — like ZIP code, birthdate, or gender — to uniquely identify individuals. These attributes, while not unique on their own, can become unique when combined, making re-identification possible.

Attackers exploit what is known as auxiliary information — data from external sources that can be cross-referenced with the anonymized dataset. For example, public voter records or social media profiles may contain attributes that, when matched with the anonymized data, enable the identification of individuals. This process is called a privacy attack, and it demonstrates that simply removing direct identifiers is not enough to guarantee privacy.

Note
Definition

Quasi-identifiers are sets of attributes in a dataset that, while not unique identifiers by themselves, can be combined with external information to re-identify individuals. Examples include combinations like ZIP code, birthdate, and gender.

123456789101112131415
import pandas as pd # Create a synthetic dataset data = { "zip_code": ["02138", "02139", "02138", "02140"], "birthdate": ["1980-05-12", "1975-09-23", "1990-11-02", "1980-05-12"], "gender": ["F", "M", "F", "M"] } df = pd.DataFrame(data) # Check for unique combinations of quasi-identifiers unique_rows = df.groupby(["zip_code", "birthdate"]).size().reset_index(name='count') print("Combinations of ZIP code and birthdate:") print(unique_rows)
copy

1. Which of the following best explains why re-identification is possible in anonymized datasets?

2. How does auxiliary information contribute to privacy attacks?

question mark

Which of the following best explains why re-identification is possible in anonymized datasets?

Select the correct answer

question mark

How does auxiliary information contribute to privacy attacks?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 1
some-alt