Once a decade, the U.S. Census Bureau undertakes the massive challenge of counting every single person in the nation. We’re all asked to share sensitive information about ourselves—such as race, sex, and date of birth—which the bureau then uses to publish aggregate statistics about the population.
Published statistics are meant to be anonymous. But what if it’s possible to use the published numbers to reconstruct the raw data? After the 2010 Census, the bureau looked into how well an attacker might be able to reconstruct the underlying data from the 2010 Census given its privacy protections. The results were troubling: Someone could correctly recover census block (that is, location), sex, ethnicity, and age (plus or minus one year) for 71 percent of the population. That might not sound that bad, since the information would still not have a name attached to it. But if an attacker linked the reconstructed data with external, commercial datasets including names and addresses, they would be able to correctly re-identify at least 17 percent of the population. As with data breaches, reidentification could put people at risk of identity theft or privacy-invading ad targeting based on sensitive attributes like ethnicity or age.
In response, the Census Bureau decided for the 2020 Census to use differential privacy, a state-of-the-art approach born from cryptography. Differential privacy limits the extent to which a statistic would differ if it were to be calculated with or without any given person’s data, thereby maintaining individual-level privacy. This usually means adding a calibrated amount of random noise to a value that will be reported (for instance, the population of a county) in order to obscure the contribution of any given individual. For example, if a county’s population is 5,000, it may be reported as 5,005 or any other randomly drawn value from a specified probability distribution. (When the bureau applied differential privacy to the 2010 Census as a dry run, county populations differed from the original published counts on average by about four people for small counties with populations less than 1,000 and by about five people for large counties with populations of 100,000 or more.)
Adding more noise means lower accuracy but stronger privacy protections. One of the most important aspects of differential privacy is that it’s future-proof: Its privacy guarantees hold regardless of an attacker’s level of computational power or access to external data (including data that do not yet exist). So even a very, very well-resourced and motivated attacker should find it extremely challenging to use published information to figure out who you are if the information is protected under differential privacy.
Although 2020 Census statistics have yet to be fully released, the bureau is already facing pushback about the new approach. Alabama is suing the bureau (though the lawsuit was recently put on hold) on the grounds that it is violating its mandate to “report … accurate ‘[t]abulations of the population’ ” by releasing “inaccurate” statistics. Sixteen states are backing the lawsuit. The bureau, however, is also legally required not to publish individually identifiable data, putting it between a rock and a hard place. Some approach to protecting privacy is mandatory and differential privacy represents the strongest known defense against re-identification.
What’s tricky about differential privacy is deciding how much noise to add, and how to weigh the importance of accuracy versus privacy for a given dataset. For starters, the Census helps decide how to distribute more than $675 billion of federal funds every year. A community whose population is significantly underreported may not receive adequate funding for its schools and hospitals. On the other hand, threats to privacy are real and growing, particularly in the face of the U.S.’s lack of federal data privacy legislation. It’s not unreasonable to expect that potential attackers have gained access to more complete, correct information over time. Earlier we mentioned that a supposed attacker could correctly re-identify 17 percent of the population with the 2010 Census using commercial data available in 2010. That’s alarming, but the situation could be even worse: Using the best quality external data (with fewer missing records/incorrect entries) representing a worst-case scenario, an attacker could likely re-identify 58 percent of the population from 2010.
The tension between accuracy and privacy arises more generally in discussions around publishing data while protecting privacy. For some data scenarios, such as for smaller datasets, alternative approaches to privacy protection may be better suited. The Census Bureau is considering releasing the American Community Survey as fully synthetic data: generated data that mimic, but don’t exactly match, the real thing. However, to generate new data like the real data brings its own challenges. Some researchers are concerned that fully synthetic data will not be suitable for analysis purposes, because, for example, ensuring that synthetic data preserve all relationships that might ever be of interest is difficult (if not impossible). In other words, we can only be sure to preserve relationships that we already know are important, and it’s hard to know what we don’t know.
Finding the right balance between accuracy and privacy is never easy, and using differential privacy brings this question to a head. Implementing a simple differentially private algorithm for calculating a count, for example, requires setting a privacy loss budget, “epsilon.” If we choose a higher epsilon value, we expect the algorithm to add less noise to the count. In turn, privacy protections won’t be as strong. As an approach, differential privacy is agnostic about how to both choose an overall epsilon value and allocate it across released statistics. While finding a supposed “sweet spot” between accuracy and privacy is the goal, doing so is far from trivial and requires considering a few things about how differential privacy works.
First, statistics for smaller populations may be relatively less accurate compared with statistics for larger populations. To get the same privacy guarantees, we must add more noise to smaller datasets. This raises questions around whether it’s fair for certain populations—for example people living in rural areas or minorities—to be less accurately represented in the public record. On the other hand, equalizing accuracy across groups means that smaller populations will have weaker privacy guarantees.
Moreover, because the noise added by a differentially private algorithm is random, understanding its implications for a given scenario requires knowing something about the level of aggregation. By definition, random errors cancel each other out. So let’s say we’re interested in estimating the total population of a voting district by using noisy counts from census blocks that make up the district. Then, the more block counts that are used for the district count, the more likely the errors in the block counts are to cancel each other out so that the district count isn’t too affected. Census data are used for many different analytical purposes with varying levels of aggregation, however, and any single choice of how to allocate epsilon will favor some scenarios over others.
In these discussions, it might be easy to focus on accuracy since their effects may be more immediately felt. However, it’s important not to lose sight of harms associated with private data inadvertently being made public. Considering reidentification risk—and associated costs—alongside accuracy could help in more holistically assessing epsilon.
Last, one of the benefits of differential privacy is that specifications of the algorithms used, like epsilon, can be made public. That wasn’t true for previous methods of protecting privacy used by the Census Bureau—there was concern that publishing specifics of how the data were changed could end up compromising privacy protection. If we’re able to effectively communicate the implications of epsilon, we can foster public trust in the Census and engage a wider audience in discussions about the tradeoff between accuracy and privacy. For instance, stakeholders with different backgrounds and areas of expertise, such as those whose livelihoods are directly affected by census counts, can point out potential pitfalls in prioritizing either accuracy or privacy before it’s too late.
Slightly complicating the matter, however, is that for the 2020 census the bureau is planning on using a more recent variation of differential privacy, zero-concentrated differential privacy, that instead uses a parameter “rho.” Zero-concentrated differential privacy generalizes standard differential privacy and better characterizes privacy loss over multiple data releases. While guarantees provided by a given value of rho can be converted to guarantees under a simpler version of differential privacy (which includes epsilon), it nonetheless adds a potential opportunity for confusion around parameter choices.
It’s very possible that learning to navigate the tradeoff between accuracy and privacy will be a topic not only for computer scientists. Clearly, as technologists we will need society’s help in answering what’s become an increasingly pressing question: What’s the right balance and how do we find it?
Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.