Send email Copy Email Address

13 CISPA PAPERS AT iclr 2025

The International Conference on Learning Representations (ICLR) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning, but generally referred to as deep learning. The Twelfth International Conference on Learning Representations took place at Messe Wien Exhibition Center in Vienna Austria from May 7th 2024 to May 11th 2024.

 

Instruction-tuned Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at: https://github.com/egozverev/Should-It-Be-Executed-Or-Processed

This research investigates how CLIP (Contrastive Language–Image Pre-training) models memorize training data and proposes strategies to mitigate this memorization. CLIP models are known for effectively aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. However, the extent and nature of their memorization of training data were previously unclear.​

To address this, we introduce a formal definition called CLIPMem to quantify memorization within CLIP models. Our findings suggest that CLIP's memorization behavior is lies in between supervised and self-supervised learning paradigms. Notably, we observe that "mis-captioned" samples—instances where images are paired with incorrect captions—exhibit the highest levels of memorization. Additionally, the study reveals that the text encoder contributes more significantly to memorization than the image encoder, indicating that mitigation efforts should primarily focus on the text component.​

Building on these insights, we propose several strategies to reduce memorization in CLIP models. Remarkably, these strategies not only decrease memorization but also enhance the model's utility, challenging the conventional trade-off where reducing memorization typically leads to decreased performance. This work provides a deeper understanding of CLIP models and offers practical approaches to balance memorization and generalization, contributing to the development of more robust and reliable multi-modal models.

This research addresses the challenge of balancing privacy and model performance in federated learning, where many users collaboratively train a model without sharing their private data. Federated learning is often combined with differential privacy (DP), which protects individual users by adding noise to their contributions. However, this noise can reduce the model's accuracy. A common approach has been to spend each client's privacy budget uniformly over time, regardless of the learning phase.

We propose a new framework that allocates privacy budgets non-uniformly over time. In early training rounds, where the model typically learns general patterns (coarse features), clients spend less of their privacy budget. In later rounds, when the model benefits from more precise information (fine-grained features), they spend more. This approach is called spend-as-you-go and is tailored to each client’s privacy preference. Crucially, the strategy does not depend on a client’s data, which prevents unintended privacy leakage.

The paper includes both theoretical analysis and practical experiments. Theoretical results show that clients with stricter privacy budgets benefit more from this uneven spending. Experimental evaluations across several benchmark datasets (e.g., MNIST, CIFAR10) confirm that this approach improves model accuracy compared to existing baselines while still respecting all privacy constraints.

This research contributes to building more effective and flexible privacy-preserving AI systems. By enabling better model performance without compromising user privacy, it supports broader adoption of federated learning in sensitive domains like healthcare, finance, and mobile applications.

Document Visual Question Answering (DocVQA) has introduced a new paradigm for end-to-end document understanding, and quickly became one of the standard benchmarks for multimodal LLMs. Automating document processing workflows, driven by DocVQA models, presents significant potential for many business sectors. However, documents tend to contain highly sensitive information, raising concerns about privacy risks associated with training such DocVQA models. One significant privacy vulnerability, exploited by the membership inference attack, is the possibility for an adversary to determine if a particular record was part of the model's training data. In this paper, we introduce two novel membership inference attacks tailored specifically to DocVQA models. These attacks are designed for two different adversarial scenarios: a white-box setting, where the attacker has full access to the model architecture and parameters, and a black-box setting, where only the model's outputs are available. Notably, our attacks assume the adversary lacks access to auxiliary datasets, which is more realistic in practice but also more challenging. Our unsupervised methods outperform existing state-of-the-art membership inference attacks across a variety of DocVQA models and datasets, demonstrating their effectiveness and highlighting the privacy risks in this domain.

This paper revisits the robust overfitting phenomenon of adversarial training. Observing that models with better robust generalization performance are less certain in predicting adversarially generated training inputs, we argue that overconfidence in predicting adversarial examples is a potential cause. Therefore, we hypothesize that generating less certain adversarial examples improves robust generalization, and propose a formal definition of adversarial certainty that captures the variance of the model's predicted logits on adversarial examples. Our theoretical analysis of synthetic distributions characterizes the connection between adversarial certainty and robust generalization. Accordingly, built upon the notion of adversarial certainty, we develop a general method to search for models that can generate training-time adversarial inputs with reduced certainty, while maintaining the model's capability in distinguishing adversarial examples. Extensive experiments on image benchmarks demonstrate that our method effectively learns models with consistently improved robustness and mitigates robust overfitting, confirming the importance of generating less certain adversarial examples for robust generalization. Our implementations are available as open-source code at: https://github.com/TrustMLRG/AdvCertainty

This research investigates how to improve the performance of Graph Neural Networks (GNNs), which are machine learning models designed to work on graph-structured data. They are often used to solve problems related to drug design or social network analysis. A common problem with GNNs is that they can become less effective when the structure of the graph is not right. Either nodes cannot exchange information effectivey (a problem called "over-squashing") or different nodes to become too similar (called "over-smoothing"). Traditionally, researchers have tried to fix this by changing the structure of the graph to make information flow more easily, often by increasing a mathematical property known as the "spectral gap."

However, this paper shows that increasing the spectral gap isn’t always the best strategy. In some cases, especially when the way a graph is organized (its communities) matches well with the categories we want the model to predict (like types of users or items), it’s better to reduce the spectral gap to preserve those communities. This can improve the model’s accuracy.

Accordingly, the authors propose three new ways to adjust graph structures:

- ComMa, which reinforces or weakens community structures depending on the task;

- FeaSt, which connects nodes with similar features to enhance meaningful information sharing;

- ComFy, a hybrid method combining both community structure and feature similarity.

Experiments on both synthetic and real-world graphs show that these new methods can outperform existing techniques, especially when they consider how well the structure of the graph aligns with the task at hand.

This research benefits society by improving the accuracy and reliability of GNNs in applications such as social network analysis, recommendation systems, or biological network studies, enabling better insights and decisions based on complex relational data.

This research focuses on making deep neural networks more efficient by reducing the number of parameters they use—a process known as sparsification. Specifically, it explores how to achieve this reduction through a method called "continuous sparsification," where networks are trained to gradually deactivate unnecessary parts. Traditionally, sparsification relied on adding explicit rules (like L1 regularization) to penalize excess parameters. However, we argue that better results can be achieved through an implicit form of regularization that naturally emerges when both the weights and their corresponding activity levels (or "masks") are learned together.

We provide a theoretical explanation of how this implicit regularization works. They show that, early in training, it behaves like a gentle (L2) penalty, allowing the network to explore and learn flexibly. Later in training, it shifts toward a stricter (L1) penalty, which encourages sparsity. Based on this, we introduce a new algorithm called PILoT that actively controls this shift over time, ensuring a smoother and more effective sparsification process.

Experimental results support the theory. PILoT outperforms other popular methods in reducing model size without sacrificing performance, especially in settings where very high sparsity is needed. It also works well when combined with existing sparsification techniques.

This research is valuable for making large-scale neural networks more resource-efficient. By enabling models to run faster and require less memory without losing accuracy, it contributes to more sustainable and accessible AI—particularly important as machine learning continues to expand into everyday applications and devices.

This research paper investigates how to efficiently optimize a broad class of mathematical functions, known as (L₀, L₁)-smooth functions, which are especially relevant for modern machine learning models. Traditional optimization methods rely on a restrictive assumption about the function's smoothness, often not suitable for complex models like deep neural networks. The (L₀, L₁)-smoothness condition is a more flexible and realistic alternative that better reflects real-world scenarios.

We develop new mathematical tools and gradient-based algorithms that work well under this generalized smoothness. We show that by carefully choosing how big each optimization step should be—based on the current state of the model—one can achieve faster and more reliable convergence to an optimal solution. Importantly, we also prove that widely used methods like the normalized gradient method and Polyak stepsizes remain effective even when the exact smoothness parameters are unknown. Additionally, we introduce an improved accelerated method that performs better than existing approaches, both in theory and practice.

Extensive theoretical analysis is supported by experiments that confirm the benefits of the proposed methods. These improvements include lower computational complexity and greater adaptability to different types of problems, especially when dealing with large models or high precision requirements.

This work contributes to the field by offering more efficient and flexible optimization techniques, which are foundational for training machine learning models. These advancements can ultimately lead to faster, more energy-efficient systems in applications ranging from natural language processing to scientific computing, benefiting both developers and end users by reducing training time and computational costs.

This research investigates how modern text-to-image diffusion models generate textual content in images. We introduce a novel method for editing text within images generated by diffusion models, which modifies very few parameters and leaves the other visual content intact. Through an activation patching, we show that less than 1% of a diffusion model's parameters—specifically in the attention layers—are responsible for generating the text content within images. This finding is consistent across different diffusion models, including both U-Net and transformer-based architectures.

Based on this insight, we introduce a technique to localize these responsible layers. We demonstrate several practical benefits of this approach. First, by applying fine-tuning only to these localized layers using a method called LoRA, they significantly improve text rendering in generated images without compromising image quality or diversity. Second, we show that this selective control allows for precise editing of the text in generated images—changing words while preserving all other visual elements. Third, we adapt the technique to suppress harmful or toxic text in images at no extra computational cost, which could help improve safety in AI-generated content.

We study also compares their method with existing editing techniques and shows that their approach is generally more accurate and efficient. Our findings contribute to better interpretability and control of diffusion models.

This research benefits society by making text-to-image generation more controllable, efficient, and safer. Applications include more reliable content moderation, especially in settings where text quality in images and safety are critical.

This research addresses a key challenge in decentralized machine learning: how to maintain fast and stable training performance when many computing nodes are involved. In decentralized systems, each node typically communicates only with a few neighbors to reduce communication costs. However, as the number of nodes increases, this limited communication slows down training because nodes struggle to stay synchronized.

We propose a new approach called TELEPORTATION to overcome this problem. Instead of greedily activating all available nodes, TELEPORTATION activates only a subset of nodes randomly during each training step. These active nodes fetch parameters from previously active nodes, update them using stochastic gradient descent (SGD), and then share results among themselves in a small, temporary communication group. By focusing only on a small group at a time, this method avoids the slowdowns seen in conventional decentralized approaches when the system scales up.

The paper also introduces a technique to automatically find the ideal number of active nodes for a given setup, reducing the time and effort needed to tune this parameter. Theoretical analysis and experiments confirm that TELEPORTATION not only avoids performance deterioration as the number of nodes increases, but also speeds up convergence and improves accuracy—especially in situations where the data across nodes is not evenly distributed.

This research benefits society by enabling more scalable and efficient distributed learning systems. These systems are especially useful in environments where data is spread across many devices or locations—such as in edge computing, healthcare, and collaborative scientific research—helping to reduce both energy usage and communication overhead without compromising performance.

This research explores how to improve the efficiency of decentralized machine learning when devices can only exchange limited, compressed information. In real-world distributed systems—such as mobile networks or edge computing—communication between nodes can be slow and costly. To address this, algorithms often compress data sent between devices. However, this can reduce training quality and cause divergence, especially when data is unevenly distributed among devices or when only small training batches are available.

We introduce a new algorithm, MoTEF (Momentum Tracking with Error Feedback), designed to handle these challenges. Unlike earlier methods, MoTEF works well even when data varies widely across devices and when only small batches are used. It avoids unrealistic assumptions and achieves what's known as linear speed-up, meaning adding more devices actually leads to faster training—an important goal in decentralized optimization.

MoTEF combines two key strategies: tracking momentum (a way to reduce the noises) and using a variant of error feedback with gradient difference compression (a way to handle heterogeneous data). The paper also presents a second version, MoTEF-VR, that further improves performance by reducing the variances in gradient estimates—a common issue in stochastic optimization.

Through both mathematical analysis and experiments, the study shows that MoTEF performs better than previous methods in terms of speed and accuracy. It is also robust across different network setups and data distributions.

This work contributes to the development of scalable, resource-efficient machine learning systems. It can help make technologies like federated learning and decentralized AI more practical, especially in settings with limited communication capacity, such as mobile devices or privacy-sensitive applications.

As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.

This research introduces σ-Zero, a new method for crafting ℓ₀-norm adversarial examples, which are carefully manipulated inputs designed to fool machine learning models by changing as few features as possible. Most existing attacks in this area focus on other norms (like ℓ₂ or ℓ∞), because ℓ₀ attacks—those that make only sparse changes—are more difficult to compute due to their non-convex, non-differentiable nature.

We solve this challenge by proposing a differentiable approximation of the ℓ₀ norm, enabling the use of gradient-based optimization methods. We also develop an adaptive projection operator that adjusts the sparsity of the adversarial example as the optimization proceeds, ensuring the changes remain both small in number and effective in causing misclassification.

Through extensive experiments on standard image datasets like MNIST, CIFAR-10, and ImageNet, σ-Zero consistently outperforms existing state-of-the-art sparse attacks in terms of success rate, minimal perturbation size, and computational efficiency. It achieves this without requiring elaborate hyperparameter tuning or prior adversarial examples to start from. It also scales well to large datasets and models, something other sparse attacks struggle with.

This work contributes significantly to the field of adversarial robustness evaluation, offering a more reliable and efficient method for identifying vulnerabilities in deep learning models. Understanding and measuring how models respond to minimal changes is essential for improving their security, especially in high-stakes areas like medical diagnostics, autonomous vehicles, or biometric authentication systems.