Differential privacy
- What is differential privacy and what does it do?
- How does differential privacy assist with data protection compliance?
- What do we need to know about implementing differential privacy?
- What are the risks associated with using differential privacy?
What is differential privacy and what does it do?
Differential privacy is a property of a dataset or database, providing a formal mathematical guarantee about people’s indistinguishability. It is based on the randomised injection of noise.
An important aspect of differential privacy is the concept of “epsilon” or ɛ, which determines the level of added noise. Epsilon is also known as the “privacy budget” or “privacy parameter”.
Epsilon represents the worst-case amount of information inferable from the result by any third party about someone, including whether or not they participated in the input.
Noise allows for “plausible deniability” of a particular person’s personal information being in the dataset. This means that it is not possible to determine with confidence that information relating to a specific person is present in the data.
There are two ways for the privacy budget to be enforced:
- interactive, query-based DP – this is where noise is added to each query response and querying is terminated once the privacy budget is met (ie where the information obtained from queries reaches a level where personal information may be revealed); and
- non-interactive DP – this is where the level of identifiable information is a property of the information itself, which is set for a given privacy budget. This approach can be useful for publishing anonymous statistics to the world at large.
There are two types of differential privacy available:
- global differential privacy - this adds noise during aggregation; and
- local differential privacy:- this is where each user adds noise to individual records before aggregation.
Global (or centralised) differential privacy involves an “aggregator” having access to the real data. Each user of the system that differential privacy is being used in sends information to the aggregator without noise. The aggregator then applies a differentially private mechanism by adding noise to the output (eg a response to a database query or the noise is embedded in the entire dataset). The noise is added during computation of the final result before it is shared with the third party. The main disadvantage of this approach is that the central aggregator has to access the real data. All the users have to trust the aggregator to act appropriately and protect people’s privacy.
Example
Global differential privacy was used by the US Census Bureau when collecting personal information from people for the 2020 US Census. This was done to prevent matching between an person’s identity, their information, and a specific data release. The US Census bureau was considered a trusted aggregator. In other words, they handled the information in line with the expectations of the participants and had robust controls in place.
Local differential privacy has the user of the system (or a trusted third party on a person’s behalf) applying the mechanism before they send anything to the aggregator. Noise is added to the individual (input) data points. The aggregator receives “noisy” data – this addresses the trust risk of global differential privacy as the real data is not shared with the aggregator. Since each user is responsible for adding noise to their own information, the total noise is much larger than global differential privacy. Local differential privacy requires many more users to get useful results.
There are several key differences between the two models:
- the global model leads to more accurate results with the same level of privacy protection, as less noise is added;
- the global model provides deniability of people’s non-participation (ie you cannot prove whether a person’s information was in the dataset);
- the local model provides deniability of a person’s record content, but not record association (the ability to link with other personal information which relates to that person);
- the local model is not necessarily suitable for producing anonymous information (eg statistics). However, you can use it to mitigate sensitive attribute inference (ie attributes that should not be linkable to a person, such as gender or race). For example, if something new could be learned about a known person.
Example
A smartphone OS (Operating System) developer wants to know the average number of minutes a person uses their device in a particular month, without revealing the exact amount of time.
If the smartphone OS developer wants to know this, then they should build local differential privacy into their product. This would work by default so that the person's sensitive attributes are protected and the person using the device does not have to do anything.
Instead of asking the exact amount of time, the person’s device adds any random value as noise. For example, a random number that with high probability lands in the range of -50 to +50 to the actual number of minutes they use their device and give the smartphone OS developer just the resultant sum of it. For example, if someone had a monthly usage of 300 minutes, by adding a random number of -50 to it, (300 + (-50)), they provide just the noised result, which is 250 minutes.
In this case, local DP can be applied by a user so their attributes (eg device usage times) is noised and protected but they could still be identified from other identifiers (eg their IP address).
The diagram below shows the difference between a real-world computation (where a specific person’s information is included in the processing) and an optout scenario (where the person’s information is not included). Epsilon (ε) is the maximum distance between a query on a database (real-world computation) and the same query on a database with a single entry added or removed.
Small values of ε provide very similar outputs when given similar inputs, and therefore provide higher levels of privacy as more noise is added. Therefore, it is more difficult to distinguish whether a person’s information is present in the database. Large values of ε allow less similarity in the outputs, as less noise is added and therefore it is easier to distinguish between different records in the database.
Practical applications using the local model often use higher values of epsilon than the global model, due to the higher amount of noise required. If you require anonymous information as output, you can set epsilon so that the relative difference in the result of the two scenarios is so small that it is unlikely anyone could identify a specific person in the input.
How does differential privacy assist with data protection compliance?
You can use differential privacy to:
- anonymise personal information, as long as you add an appropriate level of noise; and
- query a database to provide anonymous information (eg statistics).
Both models of differential privacy are able to provide anonymous information as output, as long as a sufficient level of noise is added to the data. The local model adds noise to the individual (input) data points to provide strong privacy protection of sensitive attributes. As the noise is added to each individual contribution, this will result in less accurate and useful information than the global model.
Any original information retained by the aggregator in the global model or the individual parties in the local model is personal information in their hands. This also applies to any separately held additional information that may re-identify. For example, device IP address, unique device ID of people stored by the aggregator in the global model or the recipient of the information in the local model. However, in either model, the output may not be personal information in the hands of another party, depending on whether or not the risk of re-identification is sufficiently remote in their hands.
What do we need to know about implementing differential privacy?
Using differential privacy may not be beneficial due to noise addition. It is challenging to generate differentially private outputs that provide strong protection and good utility for different purposes. However, differential privacy can be useful for statistical analysis and broad trends, rather than for detecting anomalies or detailed patterns within data.
What are the risks associated with using differential privacy?
Differential privacy does not necessarily result in anonymous information. If you do not configure differential privacy properly, there is a risk of personal information leakage from a series of different queries. For example, if the privacy budget is poorly configured, an attacker can accumulate knowledge from multiple queries and re-identify someone.
Each subsequent release of a dataset constructed using the same underlying people further accumulates epsilon values. You should deduct the accumulated epsilon values from your overall privacy budget. This helps you ensure that no further queries can be answered once your privacy budget is exhausted. You should take this into account when setting your privacy budget and 'per-query' epsilon. For example, excessive queries early in the analysis can lead either to noisier outputs later, or no outputs at all, in order to remain within the privacy budget.
For example, a release mechanism requires 10 queries to produce the dataset to be released, using an ε of 0.1 in each query. It is generated once a year to provide yearly updates. Each year a privacy cost of one is being incurred, so after five years the privacy cost is five.
You should tune differential privacy on a case-by-case basis. You could consider obtaining expert knowledge for best results. Your privacy budget assessment should consider:
- the overall sensitivity of the information, which you can determine by measuring the specific weight of a record on the result of the query;
- the nature of the attributes;
- the type of query made;
- the size of the population in the database;
- the number of queries that are likely to be made over the data lifecycle; and
- whether you set the privacy budget per user or globally, or both.
When setting an appropriate privacy budget to enforce limits on the number of queries made, you should consider the risk of unintended disclosure of personal information in any query you perform on the information. You should also consider contractual controls to mitigate malicious parties increasing the total amount of information they hold by making similar queries and then sharing them between each other.
You should consider whether it is likely that:
- an attacker could accumulate knowledge on a person from the inputs or intermediate results;
- an attacker could accumulate knowledge from multiple queries; and
- malicious parties could collude to pool the results of their queries and increase their collective knowledge of the dataset.
If there is still some risk, you should adjust the privacy budget and re-assess the risk until they are reduced to a remote level.
Further reading – ICO guidance
For more information on assessing identifiability, see our draft anonymisation guidance “How do we ensure anonymisation is effective?”
Further reading
For more information on the concept of differential privacy, see Harvard University’s publication “Differential Privacy: A Primer for a Non-technical Audience” (external link, PDF). Harvard University also has a number of open-source toolkits and resources available, such as its OpenDP Project.
For more information on differential privacy and the epsilon value, see Purdue University’s publication “How Much Is Enough? Choosing ε for Differential Privacy” (external link, PDF).
The Government Statistical Service has an introduction on differential privacy for statistical agencies, accessible on request from the GSS website.
For an analysis of differential privacy in the context of singling out, linkability and inferences see section 3.1.3 of the Article 29 Working Party’s Opinion 05/2014 on anonymisation techniques (external link, PDF).
OpenMined’s blog on “Local vs global differential privacy” provides a useful description of the two types along with some code examples.
For more information on using differential privacy for machine learning applications, see Microsoft’s blog on privacy-preserving machine learning and the Google blog Federated Learning with Formal Differential Privacy Guarantees.