Send email Copy Email Address

Sebastian Klöckner

Virtual cohorts for medical research

Artificial intelligence for the benefit of data protection Researchers at the DZNE and the Helmholtz Center for Information Security (CISPA) intend to use “artificial intelligence” to facilitate the transfer of genomic data for research purposes under the strict constraints of data protection. Their aim is to generate data on “virtual cohorts” that contain key information from real study subjects but do not allow conclusions to be drawn on individuals. The project named “PRO-GENE-GEN” has a total volume of about 360,000 euros. It is funded by DZNE, CISPA and the Helmholtz Association over the next three years.

The human genome is a veritable treasure trove for therapeutic research. It is here that mutations can be found that promote or even directly trigger certain diseases. In addition to genomic data in the strict sense, the “transcriptome” is also relevant as it contains information about which parts of the genome are actually active - a pattern that can change due to disease. However, scientific studies involving the participation of individuals are extremely complex. There is therefore great demand for making data from such studies generally accessible: for example, to ensure the verification of research results or to make the data available to research projects that only emerged afterwards. “Today’s medical research is data-driven. So-called big data is considered to be a key to the development of personalized therapies that are better tailored to each individual than conventional treatments,” explains Dr. Matthias Becker, a bioinformatician at the DZNE’s Bonn site.

Privacy protection
However, the sharing of study data is so far only possible to a limited extent, said Becker: “Handling is subject to strict legal regulations, because the data is person-related. Although there are mechanisms to protect privacy, these are either burdensome or cannot be implemented in practice. In particular genomic data can therefore not be shared to an extent necessary to ensure scientific progress.”

The team of DZNE and CISPA scientists therefore aims to develop methods to improve the dissemination of such information. “On the basis of real genome data, we want to create synthetic data sets that contain key information from the original data while fully safeguarding privacy. In a sense, it is a matter of a creating a data protection-compliant replica. This is somewhat similar to a witness testimony, where the voice is altered to protect identity,” said Becker. “This allows large data sets to be made publicly accessible, which is enormously important for progress in medicine.”

Learning algorithms
The researchers start by addressing specific topics. “For example, it might be about the pattern of gene expression. In other words, about the question of which genes are active in a certain disease,” said Prof. Mario Fritz, a CISPA scientist who is coordinating the research project jointly with Becker. “We want to train learning-based algorithms to recognize such patterns in genomic data. This is a challenge because already the data contained in a single genome is extremely complex. And the algorithms use data from hundreds or even thousands of people. This is where the strengths of machine learning come in.”

Furthermore, by means of an approach called “privacy-compliant generative modelling”, the researchers intend to develop computer models that reproduce the quintessence of such data patterns while at the same time hiding personal identity. “One can perhaps think of it as a smart filtering device,” said Fritz. This way, real data should be converted into synthetic data. “If we start with the data from a thousand people, we end up with a similarly large number of synthetic data sets. This is what we call a virtual cohort. Its data can be analyzed in the same way as real data with common tools of genetic research, but without compromising the privacy of the real individuals.”

Loss of information during the conversion from real to synthetic data cannot be excluded. The team of DZNE and CISPA plans to investigate and quantify how significant these losses are. “It’s not a question of a complete copy,” said Becker. “However, ideally, the synthetic data should not only be tailored to a specific question, but should be as generally usable as possible. For example, for research on dementia, research on cancer or generally for research into widespread diseases in which genetics is important. Ultimately, the aim is to create options for data exchange that can be used on a broad scale within the scientific community.”

The PRO-GENE-GEN project was developed within the framework of the “Helmholtz Medical Security, Privacy, and AI Research Center” (HMSP). The HMSP (Web: is an association of six Helmholtz Centers - including CISPA and DZNE - that address key challenges in the field of health research with focus on security, privacy, and artificial intelligence.

On the German Center for Neurodegenerative Diseases (DZNE)
The DZNE investigates all aspects of neurodegenerative diseases (such as Alzheimer’s, Parkinson’s and Amyotrophic lateral sclerosis) in order to develop novel approaches of prevention, treatment, and health care. The DZNE is comprised of ten sites across Germany and cooperates closely with universities, university hospitals, and other institutions on a national and international level. The DZNE is a member of the Helmholtz Association.

On the CISPA Helmholtz Center for Information Security
CISPA – located in Saarbrücken – is one of the world’s leading research institutions in information security and privacy, with a dedicated focus on addressing the grand research challenges in security and privacy in a comprehensive and holistic manner. Medical security and privacy, as well as foundational research in AI/machine learning have been topics of central importance for CISPA ever since its inauguration.