K-anonymity
In this section, we introduce the concepts of de-identification, re-identification, quasi-identifier, and K-anonymity.
De-identification
The table below is a fictional healthcare database, containing the name, data of birth, ZIP code, gender and disease of hospital patient. This example is directly taken from the paper “Data Privacy: Definitions and Technique; by Sabrina De Capitani di Vimercati, Sara Foresti, Giovanni Livraga, Pierangela Samarati”, with randomly generated first and last names.
Table 1
Name | DoB | ZIP code | Gender | Disease |
---|---|---|---|---|
Josef Wormley | 02/09/1970 | 94152 | M | Hepatitis |
Sonia Soliman | 20/09/1970 | 94143 | F | Cardiomyopathy |
Gayla Phong | 12/09/1970 | 94148 | F | Eczema |
Del Dear | 05/09/1970 | 94155 | M | Pneumonia |
Kacie Michelsen | 01/08/1960 | 94154 | F | Stroke |
Bambi Eichorn | 02/08/1960 | 94153 | F | Stroke |
Gary Hundley | 10/08/1960 | 94140 | M | Stroke |
Seymour Simons | 20/08/1960 | 94141 | M | Stroke |
Ilary Frisby | 07/08/1970 | 94141 | F | High Cholesterol |
Brandy Dusek | 05/08/1970 | 94142 | F | Erythema |
Kurtis Osborne | 09/07/1958 | 94232 | M | Diabetes |
Emery Yearta | 25/08/1970 | 94153 | M | High Cholesterol |
Miguel Voss | 30/08/1970 | 94156 | M | Angina Pectoris |
Derrick Vanzant | 02/09/1960 | 94147 | M | Hepatitis |
Alexandre Taggart | 05/09/1960 | 94145 | M | Flu |
Sarina Keim | 10/09/1960 | 94158 | F | Angina Pectoris |
Idell Shaw | 30/09/1960 | 94159 | F | Cardiomyopathy |
This table clearly contains personal and confidential information, and the patients concerned have the right for this information not to be disclosed publicly. On the other hand, analysing this data might help derive some correlation between some factors (such as a age, gender, or location) and disease, leading to a better understanding of these diseases, and hopefully improving treatment.
For instance, Covid-19 seems to have a different death rate for men and women, and understanding if this difference is biological or social could help treat it or mitigate its impact.
Of course, the data controller can de-identify the data before releasing it, by removing the attributes that uniquely identify a respondent. In this case, the name in the table could be removed, keeping only the date of birth, the ZIP code and the gender.
Re-identification with quasi-identifiers
However, as demonstrated in the Netflix case mentioned before, it might be possible to re-identify data tuples, i.e., to find the identity corresponding to a data tuple using information attributes which, on their own, would not be enough. Such attributes are called quasi-identifiers. For instance, a seminal paper by Sweeney showed that 87% of the population in the United States have a unique combination based only on 5-digit ZIP, gender and date of birth.
In other words, if you assume that an attacker knows the ZIP code, gender and date of birth of the whole US population, then even if one removes the names of every single patient above, the attacker would be able to re-identify 87%.
For instance, consider the attacker has the following information about the population.
Name | DoB | Gender |
---|---|---|
Sonia Soliman | 20/09/1970 | F |
Bambi Eichorn | 02/08/1960 | F |
Josef Wormley | 02/09/1970 | M |
Emery Yearta | 25/08/1970 | M |
Miguel Voss | 30/08/1970 | M |
Kacie Michelsen | 01/08/1960 | F |
If we were to release the following de-identified Table 2, could you map each identifier with a name? Note that we use arbitrary disease names here.
Table 2
ID | DoB | Gender | Disease |
---|---|---|---|
id1 | 02/09/1970 | M | A |
id2 | 20/09/1970 | F | B |
id3 | 01/08/1960 | F | C |
id4 | 02/08/1960 | F | D |
id5 | 25/08/1970 | M | E |
id6 | 30/08/1970 | M | F |
Since it is possible to identify each individual with a unique name, we now decide to remove some of the information. In Table 3, we remove the day and the month of the date of birth, as well as the gender. For each id in Table 3, which name could it correspond to?
Table 3
Id | DoB | Gender | Disease |
---|---|---|---|
id1 | */1970 | * | A |
id2 | */1970 | * | B |
id3 | */1960 | * | C |
id4 | */1960 | * | D |
id5 | */1970 | * | E |
id6 | */1970 | * | F |
We do something similar in Table 4, by removing the date of birth, but keeping the gender. For each id in Table 3, which name could it correspond to?
Table 4
Id | DoB | Gender | Disease |
---|---|---|---|
id1 | * | M | A |
id2 | * | F | B |
id3 | * | F | C |
id4 | * | F | D |
id5 | * | M | E |
id6 | * | M | F |
K-anonymity
This leads us to the definition of k-anonymity: A table is k-anonymous if every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents. If you can re-identify exactly each individual, then k=1; if you know that each individual is in a group of two tuples, then k=2, etc.
In order to calculate the k-anonymity of a table, the easiest is to associate each possible combination of quasi-identifiers values present in the table with a specific class. Different tuples with the same combination of quasi-identifiers will therefore be associated with the same class. The value for k is cardinality of the smallest class (i.e., the smallest number of tuples that have the same combination of quasi-identifiers).
We provide below the identification of all the relevant classes for the tables 2-4 presented above, as well as the resulting k-anonymity.
Table 2 with classes
ID | DoB | Gender | Disease | Class |
---|---|---|---|---|
id1 | 02/09/1970 | M | A | 0 |
id2 | 20/09/1970 | F | B | 1 |
id3 | 01/08/1960 | F | C | 2 |
id4 | 02/08/1960 | F | D | 3 |
id5 | 25/08/1970 | M | E | 4 |
id6 | 30/08/1970 | M | F | 5 |
Table 3 with classes
Id | DoB | Gender | Disease | Class |
---|---|---|---|---|
id1 | */1970 | * | A | 0 |
id2 | */1970 | * | B | 0 |
id3 | */1960 | * | C | 1 |
id4 | */1960 | * | D | 1 |
id5 | */1970 | * | E | 0 |
id6 | */1970 | * | F | 0 |
It is worth remarking that, for Table 3, even though class 0 has 4 elements, the k-anonymity is 2 because the class 1 has only 2 elements.
Table 4 with classes
Id | DoB | Gender | Disease | Class |
---|---|---|---|---|
id1 | * | M | A | 0 |
id2 | * | F | B | 1 |
id3 | * | F | C | 1 |
id4 | * | F | D | 1 |
id5 | * | M | E | 0 |
id6 | * | M | F | 0 |
In the next section, we introduce data generalisation techniques to increase the k-anonymity of a table.
References
- De Capitani Di Vimercati, S., Foresti, S., Livraga, G. and Samarati, P., 2012. Data privacy: definitions and techniques. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 20(06), pp.793-817.