K-anonymity

In this section, we introduce the concepts of de-identification, re-identification, quasi-identifier, and K-anonymity.

De-identification

The table below is a fictional healthcare database, containing the name, data of birth, ZIP code, gender and disease of hospital patient. This example is directly taken from the paper “Data Privacy: Definitions and Technique; by Sabrina De Capitani di Vimercati, Sara Foresti, Giovanni Livraga, Pierangela Samarati”, with randomly generated first and last names.

Table 1

NameDoBZIP codeGenderDisease
Josef Wormley02/09/197094152MHepatitis
Sonia Soliman20/09/197094143FCardiomyopathy
Gayla Phong12/09/197094148FEczema
Del Dear05/09/197094155MPneumonia
Kacie Michelsen01/08/196094154FStroke
Bambi Eichorn02/08/196094153FStroke
Gary Hundley10/08/196094140MStroke
Seymour Simons20/08/196094141MStroke
Ilary Frisby07/08/197094141FHigh Cholesterol
Brandy Dusek05/08/197094142FErythema
Kurtis Osborne09/07/195894232MDiabetes
Emery Yearta25/08/197094153MHigh Cholesterol
Miguel Voss30/08/197094156MAngina Pectoris
Derrick Vanzant02/09/196094147MHepatitis
Alexandre Taggart05/09/196094145MFlu
Sarina Keim10/09/196094158FAngina Pectoris
Idell Shaw30/09/196094159FCardiomyopathy

This table clearly contains personal and confidential information, and the patients concerned have the right for this information not to be disclosed publicly. On the other hand, analysing this data might help derive some correlation between some factors (such as a age, gender, or location) and disease, leading to a better understanding of these diseases, and hopefully improving treatment.

For instance, Covid-19 seems to have a different death rate for men and women, and understanding if this difference is biological or social could help treat it or mitigate its impact.

Of course, the data controller can de-identify the data before releasing it, by removing the attributes that uniquely identify a respondent. In this case, the name in the table could be removed, keeping only the date of birth, the ZIP code and the gender.

Re-identification with quasi-identifiers

However, as demonstrated in the Netflix case mentioned before, it might be possible to re-identify data tuples, i.e., to find the identity corresponding to a data tuple using information attributes which, on their own, would not be enough. Such attributes are called quasi-identifiers. For instance, a seminal paper by Sweeney showed that 87% of the population in the United States have a unique combination based only on 5-digit ZIP, gender and date of birth.

In other words, if you assume that an attacker knows the ZIP code, gender and date of birth of the whole US population, then even if one removes the names of every single patient above, the attacker would be able to re-identify 87%.

For instance, consider the attacker has the following information about the population.

NameDoBGender
Sonia Soliman20/09/1970F
Bambi Eichorn02/08/1960F
Josef Wormley02/09/1970M
Emery Yearta25/08/1970M
Miguel Voss30/08/1970M
Kacie Michelsen01/08/1960F

If we were to release the following de-identified Table 2, could you map each identifier with a name? Note that we use arbitrary disease names here.

Table 2

IDDoBGenderDisease
id102/09/1970MA
id220/09/1970FB
id301/08/1960FC
id402/08/1960FD
id525/08/1970ME
id630/08/1970MF

Since it is possible to identify each individual with a unique name, we now decide to remove some of the information. In Table 3, we remove the day and the month of the date of birth, as well as the gender. For each id in Table 3, which name could it correspond to?

Table 3

IdDoBGenderDisease
id1*/1970*A
id2*/1970*B
id3*/1960*C
id4*/1960*D
id5*/1970*E
id6*/1970*F

We do something similar in Table 4, by removing the date of birth, but keeping the gender. For each id in Table 3, which name could it correspond to?

Table 4

IdDoBGenderDisease
id1*MA
id2*FB
id3*FC
id4*FD
id5*ME
id6*MF

K-anonymity

This leads us to the definition of k-anonymity: A table is k-anonymous if every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents. If you can re-identify exactly each individual, then k=1; if you know that each individual is in a group of two tuples, then k=2, etc.

In order to calculate the k-anonymity of a table, the easiest is to associate each possible combination of quasi-identifiers values present in the table with a specific class. Different tuples with the same combination of quasi-identifiers will therefore be associated with the same class. The value for k is cardinality of the smallest class (i.e., the smallest number of tuples that have the same combination of quasi-identifiers).

We provide below the identification of all the relevant classes for the tables 2-4 presented above, as well as the resulting k-anonymity.

Table 2 with classes

IDDoBGenderDiseaseClass
id102/09/1970MA 0
id220/09/1970FB 1
id301/08/1960FC2
id402/08/1960FD3
id525/08/1970ME 4
id630/08/1970MF5
The K-anonymity of the table is 1 (example of mininal class: 0)

Table 3 with classes

IdDoBGenderDiseaseClass
id1*/1970*A 0
id2*/1970*B0
id3*/1960*C1
id4*/1960*D1
id5*/1970*E0
id6*/1970*F0
The K-anonymity of the table is 2 (example of mininal class: 1)

It is worth remarking that, for Table 3, even though class 0 has 4 elements, the k-anonymity is 2 because the class 1 has only 2 elements.

Table 4 with classes

IdDoBGenderDiseaseClass
id1*MA 0
id2*FB 1
id3*FC1
id4*FD1
id5*ME0
id6*MF0
The K-anonymity of the table is 3 (example of mininal class: 0)

In the next section, we introduce data generalisation techniques to increase the k-anonymity of a table.

References

  • De Capitani Di Vimercati, S., Foresti, S., Livraga, G. and Samarati, P., 2012. Data privacy: definitions and techniques. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 20(06), pp.793-817.

Last modified April 30, 2020