PriSyn project makes medical data securely usable

Health data such as blood counts, genetic information or MRI findings of study participants play a crucial role in the development of modern drugs and treatment methods. At the same time, however, this data is also highly sensitive. They therefore enjoy special protection and their use and disclosure must be strictly limited and controlled. The new project PriSyn (representative synthetic health data with strong privacy guarantees) develops an innovative method that enables the use of significantly more medical data than before - and at the same time guarantees data protection and the privacy of the study participants. The CISPA Helmholtz Center for Information Security, the German Center for Neurodegenerative Diseases (DZNE), the Saarbrücken-based startup QuantPi and the IT technology company Hewlett Packard Enterprise (HPE) are collaborating on the project. The BMBF is funding the three-year project with 2.2 million euros.

Researchers can already combine biomedical data from different fields and analyze these complex data sets using machine learning methods. However, in practice, it is currently still immensely difficult to combine health data from different sources - for example, different clinics, and in some cases even from different countries - while guaranteeing its protection.

There are already ways of anonymizing the data before it is passed on. With the help of mechanisms known as differential privacy, strong guarantees can be made about privacy protection in this regard. "This means that algorithms and analyses are deliberately noisy so that the resulting fuzziness means it is no longer possible to draw conclusions about patient data," explains CISPA researcher Prof Dr Mario Fritz. "It is important that the data processing still retains its scientific and medical utility despite this noise." When it comes to data analysis, however, the use of such privacy-protecting mechanisms still creates greater challenges: "Special algorithms must be used for this purpose, and a kind of accounting must be performed of every access to the data. This is difficult to incorporate into researchers' existing workflows," says Fritz.

The solution to these problems could be synthetic data with strong privacy guarantees (differential privacy). They can be produced using generative machine learning models. "AI trained under differential privacy thus creates artificial data that reflects the statistical properties of real data sets. At the same time, we can provide guarantees that there will be no privacy risks for patients even when sharing or accessing this data on multiple occasions," Fritz explains. He is coordinating the project and is pushing forward with research on trustworthy generative models. CISPA faculty Dr. Yang Zhang will thoroughly vet the models' security against privacy leakage.

But security and trustworthiness of models are only half of the equation. To arrive at working models in the first place, researchers need lots of data. This is used to train the models for their tasks. Creating suitable data sets for the biomedical use case under investigation in each case is what the DZNE does. "We want to use DZNE study cohorts on a trial basis to develop clinical assistance systems for neurodegenerative diseases and compare their performance with systems trained with synthetic data. Of course, thanks to the synthetic data, the patient data will never be published or shared in the process," says Dr. Matthias Becker, who is working on the project with Dr. Maren Büttner at DZNE.

Making the quality of the synthetic data measurable for the respective use case is the task of the Saarbrücken-based startup QuantPi. Co-founder and head of research Dr. Antoine Gautier says, "Research is still being done on how to assess the quality of synthetic data and their generators. However, such testing is closely related to assessing risks to the trustworthiness of AI-based systems - a core function of the QuantPi platform. Therefore, QuantPi will identify appropriate measures and benchmark experiments that can accurately analyze and also control the necessary trade-off between privacy protection and utility. In addition, metrics should reflect trustworthiness risks in terms of potential data quality issues, bias, and discrimination in the synthetic data. The high-dimensional biological data and the black-box generative process pose additional challenges in evaluating the utility of the synthetic data."

Still, the very best research is of no use if it cannot be applied in the real world. To ensure that, too, IT giant Hewlett Packard Enterprise (HPE) will focus on ensuring that the models flow into hardware that can be used efficiently and is easy for users to understand. “To drive broad adoption by physicians, the local software and hardware must meet three key criteria: efficiency of implementation, ease of use, and end-to-end security. Another key priority is platform independence to enable a truly open ecosystem of sovereign data owners”, says Hartmut Schultze, Lead Architect, HPE.

Mario Fritz is convinced that there is a lot of interest in the use of generative models in biomedicine. “With this project, we want to improve the use of the existing potential of health data."