Meta Research funds work of CISPA researcher Dr. Sebastian Stich

The possibilities offered by machine learning (ML) are becoming increasingly diverse. For example, ML models predict the weather or traffic jams, make product recommendations to consumers, or ensure that autonomous vehicles interpret signs correctly and recognize obstacles on time. ML is also already used in medicine, where it is seen as a promising tool for the early detection and diagnosis of diseases. The models learn all this from large amounts of data, in some cases without any human assistance at all.

The first question developers and researchers always ask is: where do we get all this data? It has long been collected everywhere: in apps, from fitness trackers and smartwatches. But securely exchanging them is a significant challenge for research and industry. Many researchers worldwide are working on various approaches to transfer the data to the models securely. Others, however, have opted for the opposite approach: they bring the models to the data. This is what happens, for example, in so-called federated machine learning, which Sebastian Stich is also working on intensively. "In federated machine learning, the collected data stays on the devices that collect them. They are not pooled on a server somewhere, as in centralized approaches. Instead, the devices evaluate their data locally and use the results to train a centrally stored machine learning model jointly."

This technology is already being used in smartphones, for example, where it is used to improve the autocorrect function of keyboards. Federated learning (FL) offers major advantages over centralized models in terms of data protection. However, even for federated models, sufficient privacy guarantees can only be made if it is ensured that the central model ultimately no longer allows any identifying inferences about data from the local submodels.

This is where so-called differential privacy comes into play. "Differential privacy" is a mathematical model that can be used to measure privacy. However, it is often used to refer to various mechanisms that can be used to produce more privacy protection. For example, adding "noise" can be used to specifically obscure private properties in data," says Stich. However, too much noise also degrades data quality and can limit the effectiveness of models. "The trick is to find the right trade-off between performance and privacy protection."

To do that, you need effective algorithms. And that's precisely what Stich is working on in the Meta Research-funded project. Specifically, the mathematician's goal is to develop algorithms further so that they artificially alter the data only as much as necessary but as little as possible. That way, strong privacy guarantees can be made if the models perform well. "So far, privacy is still mostly determined regarding the entire data set. Simply put, existing algorithms add the same amount of noise to all data points. The results are then usually still not satisfactory either in terms of model performance or privacy protection." More fruitful, he believes, could be approaches that first analyze data sets in detail and then weigh the points at which they need to be protected. "Such approaches already exist. But we need to understand them better. That's why I want to contribute even more to the theory and propose concrete improvements." The researcher also wants to continue working on the project to improve communication between the local submodels and the central model.

It will probably be some time before FL is used on a large scale. However, Sebastian Stich's research is helping to lay the foundation for safe and effective use. "I am delighted about the funding from Meta Research. It helps me to push my research further and hire junior scientists to help me do so."

translated by Oliver Schedler