Differential privacy
====================
Much of cryptography, systems security so far: strong confidentiality.
Even zero-knowledge is striving for revealing nothing.
What can we do in situations where we want / need to reveal some information?
Ideally want to have more privacy than revealing everyone's full data.
Can we make this more precise, and more secure against adversaries?
Many examples:
Demographics, ala census.
Aggregate network use statistics.
Aggregate statistics about Google search queries.
Public-health data.
Typically two problems to be solved for this setting.
1. What should we release, if all data is in a trustworthy computer?
2. What if we don't have a single trustworthy computer with all the data?
This lecture is about the first problem.
Second question is the focus of multi-party computation, trusted hardware, etc.
Strawman: anonymize or remove names.
Suppose a hospital has a database of patients.
Name, diagnosis, medications, outcome, date of birth, ZIP code, ..
What if we just remove the name?
Date-of-birth and ZIP code could be nearly unique for a person.
Adversary might have additional information (like people's DOB and ZIP data).
This would allow adversary to re-identify people, even if name is removed.
Strawman: make sure there's other identical people in the database.
"K-anonymity".
Figure out which rows are "identifying" (like DOB and ZIP).
Ensure there's at least K (e.g., 2 or 10) records with same identifying fields.
Weakness: what if the adversary knows the other K-1 people in your group?
Varient of that weakness: adversary now knows what conditions all of K people have.
E.g., what if every one of the K people in your group have COVID?
Adversary knows you definitely have COVID too.
Strawman: release only aggregates of many people (e.g., averages).
E.g., only release the average grades of several students.
Might form groups, but perhaps OK if groups are large enough?
What if we release average midterm grades in 6.1600 by dorm?
Weakness: adversary might know everyone except you in your average group.
E.g., knows everyone else in your dorm taking 6.1600.
If adversary knows everyone else's score, can figure out your score from avg.
Differential privacy.
What does it mean for some aggregate result to preserve privacy?
Intuition: result should not be influenced too much by any individual.
Setting: database D, query/mechanism M.
D (with person X) -> M -> result r
D' (without person X) -> M -> result r'
As with the "average" strawman, deterministic averages are problematic.
Again, if adversary knows everyone but X, could figure out if X's data.
One observation from differential privacy: need randomness for privacy.
Definition.
For every database D and a neighboring database D' (one person added/removed),
and every possible result r, it must be
Pr[M(D) = r] <= e^\epsilon Pr[M(D') = r]
Every result possible with X must also be possible without X.
Probability of seeing that result must also be pretty similar (within e^\epsilon).
Imagine \epsilon being ideally something like 0.5--1 (so e^\epsilon =~ 2).
What's good about this definition?
Does not depend on what else adversary knows (all other people, etc).
Gives a guarantee from an individual person's perspective.
Closed under post-processing: adversary processing DP result can't leak more.
Closed under composition: result of two queries has sum of epsilon privacy.
What's perhaps surprising about this definition?
Can't have results that are only possible with X, and impossible without X.
Every result has to be possible both with and without X.
Effectively requires that mechanism M must be randomized.
What's not so great about definition?
Strong trade-off between privacy (lower epsilon) and noise.
For a small data set and strong privacy, result might be almost useless.
At some level, maybe unavoidable, but means can be hard to apply in practice.
Doesn't cover data that's not "one row away" (e.g., spread among many rows).
What if a user has many location records over time?
Relatedly, not as good of a fit for streaming data or continuous data over time.
Privacy continues to decrease with more queries.
How to keep using the same data?
Definition geared towards specific queries, not releasing datasets.
Perhaps also unavoidable at some level.
How to achieve differential privacy?
Suppose we want to publish the average grade by dorm.
For each dorm, actual average is a concrete value, 0 through 100.
Graph: probability on Y axis, reported average on X axis.
Two dots: for D and D'.
Again, deterministic does not satisfy DP.
Want decaying probabilities with decay multiple e^\epsilon.
Laplace mechanism.
Add Laplace noise to result.
Laplace probability distribution is exactly decaying with e^\epsilon.
Graph: Laplace PDF.
Need to know how much noise to add: depends on distance between D and D'.
Sensitivity: max grade / number of students in group.
Need noise proportional to sensitivity s.
r = f(D) + Lap(s / \epsilon)
Higher sensitivity: wider Laplace distribution.
Higher epsilon (less privacy): narrower Laplace distribution.
Global vs. local sensitivity.
Local sensitivity is the max change in result for neighbors of current database D.
Global sensitivity is the max change for any database D.
Some functions have quite different local vs. global sensitivity.
Suppose we wanted to release the max grade, instead of average grade.
If someone already had grade 100, local sensitivity is zero.
But global sensitivity is 100: might have someone with 0, and added person gets 100.
Not safe to use local sensitivity of 0 for noise!
If adversary sees exactly 100 as the noised result, quite likely that one person had 100.
In contrast, if nobody had 100, we would have added more noise.
There are more sophisticated versions, called "smooth sensitivity", that optimize this.
Need to bound sensitivity.
Sensitivity depends on the input data.
For grades, we know grade is 0-100, which bounds sensitivity.
For integers in general, if we don't know bounds, we don't know sensitivity.
E.g., avg(income) might not be that sensitive, unless we include extreme outliers.
But with outliers, sensitivity is very high, requiring lots of noise.
Can be useful to pre-process data before aggregation to limit sensitivity.
E.g., cap income to between 0 and 100K to significantly reduce noise.
Returns a somewhat different value, but might be still useful for application.
Unexpected effect: might get nonsensical results (like grade avg < 0 or > 100).
Safe to cap the result (e.g., min 0, max 100).
Preserves differential privacy.
Weaker variant of differential privacy: (\epsilon, \delta)-DP.
Pr[M(D) = r] <= e^\epsilon Pr[M(D') = r] + \delta
Useful when you can't quite sample the right randomness distribution.
Or if you want a noise distribution with fewer outliers, like Gaussian.
How to achieve differential privacy for discrete results (e.g., most common name)?
Could supply a list of possible names, ask for number of people with each name.
That requires splitting up the epsilon budget for each name.
In other words, asking for much more data than we need.
Exponential mechanism.
Still supply a list of possible names, but ask for a single most-common-name result.
Make the probability of returning each name proportional to e^(\epsilon * count),
where count is how many people have that name.
Why this probability?
Ideally would return max name with probability 1, others with probability 0.
But that wouldn't achieve DP.
Above probability achieves right proportion.
Neighboring DBs differ in count by 1.
So ratio of probabilities is exactly e^\epsilon.
Gives a more accurate result for most-common-name than asking for counts of each name.
In general, would need to scale this by sensitivity if not 1.
Count can be replaced by an arbitrary "goodness score" for that result.
Laplace mechanism then is a special case of this exponential mechanism.
A number of serious systems have tried using differential privacy (Apple, Google).
Not clear if this ended up being a practical plan.
As one example, Google's RAPPOR for reporting crashes back to Google.
[ Ref: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf ]
As far as I can tell, not used anymore.
Generated spurious backtraces for bugs that weren't real, but wasted lots of engineer time.
Replaced with another system, Prochlo.
[ Ref: https://dl.acm.org/doi/pdf/10.1145/3132747.3132769 ]
Not differential privacy, but ensures only repeated bugs/backtraces are reported.
US Census use of differential privacy.
Collects lots of data from the population every 10 years.
Publishes aggregate information, but also tries to prevent disclosure.
Used to use more ad-hoc techniques for privacy, but discovered it's not enough after 2010.
Can correlate released census data with public data sets to re-identify some people.
2020 census used differential privacy for released data.
Quite a bit more sophisticated than the simple DP story we've seen so far.
Including a more sophisticated variant of epsilon,delta-DP definition.
[ Ref: https://arxiv.org/abs/1605.02065 ]
Careful accounting of where the privacy "budget" goes.
E.g., potential breakdowns of where the privacy budget might go, on a demo using 2010 census data:
https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/02-Demographic_and_Housing_Characteristics/2022-03-16_Summary_File/2022-03-16_Privacy-Loss_Budget_Allocations.pdf
Various concerns about noised census data.
E.g., https://www.brennancenter.org/our-work/court-cases/alabama-v-us-dept-commerce
Alabama sued saying noised data is not accurate for deciding on voting districts.
One result is the epsilon went up, from original plan of 12.2 to 19.61.
Not clear that the direct DP guarantee is that strong.
e^19.61 ~= 300M times more likely.
Lots of other discussion about the use of differential privacy by the US Census.
Starting points:
https://mucollective.northwestern.edu/files/2022-dp-census.pdf
https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census/release/1
https://hdsr.mitpress.mit.edu/pub/3vj5j6i0/release/3