Send email Copy Email Address

Annabelle Theobald

How machines learn to forget

Unlike humans, artificial intelligence (AI) truly has the memory of an elephant. Once the underlying machine learning models have learned from data, they never forget it. However, humans have a right to exactly that - at least if their personal data was part of the AI's training process. This forgetting process is called machine unlearning. In her paper "Graph Unlearning," CISPA researcher Min Chen shows how this can work effectively even for complex machine learning models. She is presenting her work at the prestigious IT security conference CCS.

The right to be forgotten - a kind of veto over the infinite storage of one's data - has been a much-discussed topic since the early 2000s. In 2018, the EU finally enshrined it for all its citizens in the General Data Protection Regulation. Each case of deletion must be evaluated in its own right because the right to be forgotten is offset by freedom of opinion and freedom of the press, which can be more important in individual cases. In practice, people have since been regularly fighting with search engines like Google in particular over the deletion of embarrassing videos, unflattering images, and outdated reporting.

"In the context of machine learning, implementing the right to be forgotten would mean that providers would delete user data from their model's training set upon request," explains Min Chen. This is a legitimate concern, but deleting training data is not as easy as it sounds. After all, how and what exactly machine learning models have learned from the data shown is often not or only partially comprehensible. Removing individual data sets and their effects on model predictions without a trace is hardly possible. "Complete retraining with a cleaned dataset is time-consuming and often costly for models trained on large datasets," Chen explains.

For relatively simple ML models that work with image or text data, a better solution than retraining has recently become available: the SISA algorithm. "It randomly splits the AI's training data before training starts. A separate small machine learning model is trained with each part of the dataset. If the models are run in parallel, they can be as effective as one large one. If a person now requests the deletion of their data, only the sub-model trained on the respective data set needs to be retrained," Chen explains.

However, this approach cannot be applied the same way to complex machine learning models such as graph neural networks (GNN). "Graph neural networks are models that can also represent complex network structures, such as social networks, or even traffic or financial networks. If their training data were simply split randomly, as with other ML models, the usefulness of the graph-based models trained with them would be enormously limited."

Together with colleagues at Purdue University and CISPA, Min Chen has therefore developed an entirely new method of machine unlearning that can also be applied to a GNN. She has called her approach Graph Eraser. In addition to two new algorithms that meaningfully divide graph data, Chen also presents a new learning-based method to merge data in a mathematically meaningful way.

"Applied to large, real-world datasets, Graph Eraser is nearly 36 times faster than retraining the model. It is still twice as fast on small datasets," says Min Chen. Her paper, "Graph Unlearning," is also receiving much attention in the community and has been accepted at the top IT conference CCS, where she will present her work in November 2022.

Min Chen has been conducting research in the group of CISPA Founding Director and CEO Prof. Dr. Michael Backes since August 2019 and will soon enter her final year as a PhD student. "The Graph Eraser is just the beginning of our research on Machine Unlearning on GNNs. We continue to work on effective and elegant solutions."

translated by Oliver Schedler