K-anonymity

In this section, we introduce the concepts of de-identification, re-identification, quasi-identifier, and K-anonymity.

De-identification

The table below is a fictional healthcare database, containing the name, data of birth, ZIP code, gender and disease of hospital patient. This example is directly taken from the paper “Data Privacy: Definitions and Technique; by Sabrina De Capitani di Vimercati, Sara Foresti, Giovanni Livraga, Pierangela Samarati”, with randomly generated first and last names.

Table 1

Name	DoB	ZIP code	Gender	Disease
Josef Wormley	02/09/1970	94152	M	Hepatitis
Sonia Soliman	20/09/1970	94143	F	Cardiomyopathy
Gayla Phong	12/09/1970	94148	F	Eczema
Del Dear	05/09/1970	94155	M	Pneumonia
Kacie Michelsen	01/08/1960	94154	F	Stroke
Bambi Eichorn	02/08/1960	94153	F	Stroke
Gary Hundley	10/08/1960	94140	M	Stroke
Seymour Simons	20/08/1960	94141	M	Stroke
Ilary Frisby	07/08/1970	94141	F	High Cholesterol
Brandy Dusek	05/08/1970	94142	F	Erythema
Kurtis Osborne	09/07/1958	94232	M	Diabetes
Emery Yearta	25/08/1970	94153	M	High Cholesterol
Miguel Voss	30/08/1970	94156	M	Angina Pectoris
Derrick Vanzant	02/09/1960	94147	M	Hepatitis
Alexandre Taggart	05/09/1960	94145	M	Flu
Sarina Keim	10/09/1960	94158	F	Angina Pectoris
Idell Shaw	30/09/1960	94159	F	Cardiomyopathy

This table clearly contains personal and confidential information, and the patients concerned have the right for this information not to be disclosed publicly. On the other hand, analysing this data might help derive some correlation between some factors (such as a age, gender, or location) and disease, leading to a better understanding of these diseases, and hopefully improving treatment.

For instance, Covid-19 seems to have a different death rate for men and women, and understanding if this difference is biological or social could help treat it or mitigate its impact.

Of course, the data controller can de-identify the data before releasing it, by removing the attributes that uniquely identify a respondent. In this case, the name in the table could be removed, keeping only the date of birth, the ZIP code and the gender.

Re-identification with quasi-identifiers

However, as demonstrated in the Netflix case mentioned before, it might be possible to re-identify data tuples, i.e., to find the identity corresponding to a data tuple using information attributes which, on their own, would not be enough. Such attributes are called quasi-identifiers. For instance, a seminal paper by Sweeney showed that 87% of the population in the United States have a unique combination based only on 5-digit ZIP, gender and date of birth.

In other words, if you assume that an attacker knows the ZIP code, gender and date of birth of the whole US population, then even if one removes the names of every single patient above, the attacker would be able to re-identify 87%.

For instance, consider the attacker has the following information about the population.

Name	DoB	Gender
Sonia Soliman	20/09/1970	F
Bambi Eichorn	02/08/1960	F
Josef Wormley	02/09/1970	M
Emery Yearta	25/08/1970	M
Miguel Voss	30/08/1970	M
Kacie Michelsen	01/08/1960	F

If we were to release the following de-identified Table 2, could you map each identifier with a name? Note that we use arbitrary disease names here.

Table 2

ID	DoB	Gender	Disease
id1	02/09/1970	M	A
id2	20/09/1970	F	B
id3	01/08/1960	F	C
id4	02/08/1960	F	D
id5	25/08/1970	M	E
id6	30/08/1970	M	F

Since it is possible to identify each individual with a unique name, we now decide to remove some of the information. In Table 3, we remove the day and the month of the date of birth, as well as the gender. For each id in Table 3, which name could it correspond to?

Table 3

Id	DoB	Gender	Disease
id1	*/1970	*	A
id2	*/1970	*	B
id3	*/1960	*	C
id4	*/1960	*	D
id5	*/1970	*	E
id6	*/1970	*	F

We do something similar in Table 4, by removing the date of birth, but keeping the gender. For each id in Table 3, which name could it correspond to?

Table 4

Id	DoB	Gender	Disease
id1	*	M	A
id2	*	F	B
id3	*	F	C
id4	*	F	D
id5	*	M	E
id6	*	M	F

K-anonymity

This leads us to the definition of k-anonymity: A table is k-anonymous if every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents. If you can re-identify exactly each individual, then k=1; if you know that each individual is in a group of two tuples, then k=2, etc.

In order to calculate the k-anonymity of a table, the easiest is to associate each possible combination of quasi-identifiers values present in the table with a specific class. Different tuples with the same combination of quasi-identifiers will therefore be associated with the same class. The value for k is cardinality of the smallest class (i.e., the smallest number of tuples that have the same combination of quasi-identifiers).

We provide below the identification of all the relevant classes for the tables 2-4 presented above, as well as the resulting k-anonymity.

Table 2 with classes

ID	DoB	Gender	Disease	Class
id1	02/09/1970	M	A	0
id2	20/09/1970	F	B	1
id3	01/08/1960	F	C	2
id4	02/08/1960	F	D	3
id5	25/08/1970	M	E	4
id6	30/08/1970	M	F	5

The K-anonymity of the table is 1 (example of mininal class: 0)

Table 3 with classes

Id	DoB	Gender	Disease	Class
id1	*/1970	*	A	0
id2	*/1970	*	B	0
id3	*/1960	*	C	1
id4	*/1960	*	D	1
id5	*/1970	*	E	0
id6	*/1970	*	F	0

The K-anonymity of the table is 2 (example of mininal class: 1)

It is worth remarking that, for Table 3, even though class 0 has 4 elements, the k-anonymity is 2 because the class 1 has only 2 elements.

Table 4 with classes

Id	DoB	Gender	Disease	Class
id1	*	M	A	0
id2	*	F	B	1
id3	*	F	C	1
id4	*	F	D	1
id5	*	M	E	0
id6	*	M	F	0

The K-anonymity of the table is 3 (example of mininal class: 0)

In the next section, we introduce data generalisation techniques to increase the k-anonymity of a table.

References

De Capitani Di Vimercati, S., Foresti, S., Livraga, G. and Samarati, P., 2012. Data privacy: definitions and techniques. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 20(06), pp.793-817.

Last modified April 30, 2020