Send email Copy Email Address

Annabelle Theobald

"The new gold standard of privacy protection"

Differential Privacy is considered a gamechanger for privacy protection in data analysis. The technique has been in use for a number of years. For example, the U.S. Census Bureau used Differential Privacy (DP) to conduct the last census data publication in the U.S. in 2020. The tech giant Apple, which is particularly committed to data protection, also relies on DP to analyze the data of its users in a privacy-compliant manner. CISPA research group leader Zhikun Zhang explains what DP is, what problems it still faces, and how he wants to use his research to help make the process even better.

It's a well-known fact that data has become a tradable commodity in our digitalized world. However, not everyone is aware of how much positive benefits data already provides to society and how much more it could provide in the future. A few examples: According to experts, the analysis of medical data such as blood values, oxygen saturation, MRI scans or X-ray images with the help of artificial intelligence (AI) will take healthcare to a whole new level in the coming years. AI can combine and analyze huge amounts of data. Autonomous driving is also inconceivable without the processing of immense amounts of sensory data that is collected all around and inside the car. This is not to mention such long-established conveniences such as the display of when to expect less crowding at public swimming pools, or where the next traffic jam is likely to be, to name but a few examples. All of this is only possible through the analysis of huge amounts of data.

A lot of data, a lot of protection

But these examples also make it easy to see where the problem might lie: Much of the data mentioned is enormously sensitive and reveals quite a bit about us, our state of health, our habits and movement patterns. The protection of privacy, an old issue in itself, is thus becoming more relevant today than ever before. A solution appears to have been found with Differential Privacy, starting in 2006. "The new golden standard of privacy protection is Differential Privacy," says Zhikun Zhang.

According to the researcher, the goal of Differential Privacy (DP) is fundamentally simple: to learn as much as possible about a specific group of people from an existing dataset, without learning anything about the individuals in that group. 

What's behind differential privacy?

"For one thing, the term provides a mathematical definition of privacy. It's a kind of statistical guarantee that individual people's data won't affect the outcome of queries on larger data sets," Zhang explains. "On the other hand, it's also often used to describe the specific process by which database queries are answered in a way that maintains privacy." The creator is cryptographer Cynthia Dwork. Together with fellow researchers, she presented the first formula for measuring how much of a privacy violation a person faces when their data becomes part of a larger data collection and thus public. 

Noise for more privacy

With the large amounts of data collected today, machine learning models are trained to perform a variety of tasks. For example, a model based on a large set of data from cancer patients, such as blood values, genetic information and MRI findings, could be trained to detect developing cancer much earlier than is currently the case. To ensure that this highly sensitive medical data remains secure, it must be anonymized in some form. However, it is not enough to remove personally identifying characteristics such as names or addresses. This is because multiple queries and the combination of characteristics that at first glance appear to be of little significance often allow unambiguous conclusions to be drawn about individuals. Instead, "noise" is added to the data. This involves various methods to introduce a kind of "controlled randomness" in the response to queries. 

Still many challenges for research

The important thing is that data processing still retains its statistical utility under this noise. And that's not the only challenge. Several special algorithms often need to be used at once, and queries need to be logged and kept on record, because too many queries could reveal too much, even in the presence of noisy data. The solution to these problems may be artificially produced data with strong privacy guarantees. "We publish such synthetic data that meet DP standards and reflect the statistical properties of the real datasets, but are not subject to the same limitations in processing." 

According to Zhang, the challenge in creating synthetic data under DP is to identify the most informative statistical information possible. That's the only way to extract as much useful data as possible, even from complex datasets, such as those that map people's movement patterns or their social connections within networks. He has published several papers on his research, including presentations at the prestigious USENIX Security Symposium.

Multifaceted topic

Zhang has been conducting research beneath the California sun since October 2022. "I am a participant in the CISPA-Stanford program and am currently a visiting professor at the University of Stanford." Differential privacy continues to be a topic that keeps him busy. "I'm currently doing research with a colleague at Stanford on the question of privacy protection within large-language models, such as those in Chat-GPT, and what impact the use of differential privacy might have on such models." It's a Gold Rush.



PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Model

PrivSyn: Differentially Private Data Synthesis