icml-2025

Researchers behind Stealix have explored a new method to replicate machine learning models—commonly known as model stealing—without needing detailed knowledge about the models or their training data. This type of attack targets so-called “black-box” models, which are accessible only via inputs and outputs, not internal workings. Traditionally, to imitate such models, attackers either need large image datasets or well-crafted text prompts. However, Stealix introduces a way to automatically generate suitable prompts and image data without prior expertise.

The system uses open-source generative AI models to synthesize images that closely match the behavior of a target model. It begins with just a single real example per class and gradually refines the prompts through an evolutionary algorithm, guided by how well the synthetic images are classified by the target model. The method does not require any insight into the model architecture or access to training data. This approach proved effective across different image datasets and worked better than previous methods under similar constraints.

Still, the researchers stress that Stealix operates under controlled conditions and uses publicly available tools. Its performance depends on several factors, such as the generative model's quality and the image domain. While the technique improves the feasibility of model replication in restricted settings, it does not yet represent a universal or foolproof attack method.

From a societal perspective, this research underscores the growing risks posed by open-source AI models when used without safeguards. By realistically simulating potential misuse, the study provides valuable insights for developing better defense mechanisms. The goal is not to enable attacks, but to inform developers, researchers, and policymakers about the vulnerabilities of machine learning systems—and to help design models that are more robust and secure.

read full paper

The researchers in this paper aim to improve the reliability of explanations produced by artificial intelligence (AI) systems, especially in high-stakes areas like healthcare and autonomous driving. They focus on "attribution methods," which highlight the parts of an image (typically pixels) that influence an AI model’s decision. However, many existing methods are fragile—small, invisible changes to the image can significantly alter the explanation without affecting the AI’s actual prediction. This undermines trust in these tools.

To address this, the authors propose a new approach that certifies which pixels in an attribution map are robust to small input changes. Their method works for any attribution technique and does not require access to the inner workings of the AI model. It uses a statistical method called “randomized smoothing” to guarantee that certain pixels remain reliably influential, even when the input is slightly altered.

They test their framework on 12 popular attribution methods across five well-known AI models, including both convolutional and transformer-based architectures. The team introduces three evaluation metrics to assess robustness, localization (whether the highlighted pixels align with the object of interest), and faithfulness (how much removing important pixels affects the model’s confidence). Their findings show that certain methods, especially LRP and RISE, consistently provide robust and informative explanations.

This research offers a practical tool for improving the transparency of AI systems. By certifying that some explanations are stable and meaningful, it supports more trustworthy AI use in areas where decisions must be interpretable. While not solving all challenges in explainability, the work marks a step forward in making model explanations more dependable and easier to assess systematically.

read full paper

Image autoregressive models (IARs) have recently emerged as a strong alternative to diffusion models (DMs) for generating high-quality images. While IARs are faster and often more efficient in generating images, their privacy implications had not been systematically analyzed. In this study, the researchers sought to close that gap by rigorously examining whether IARs expose more training data than DMs.

To do this, they developed new privacy attack techniques—especially membership inference attacks (MIAs)—tailored specifically to the characteristics of IARs. These methods attempt to determine whether a particular image was part of a model’s training data. The results were clear: IARs were significantly more vulnerable. In some configurations, the attacks could detect training images with a success rate of over 86%, whereas comparable attacks on DMs barely reached 6%.

The researchers went further. They showed that it was possible to infer whether a dataset was used to train an IAR model using only a handful of image samples—a process called dataset inference. They also demonstrated that IARs could even reproduce exact training images when prompted with partial inputs, recovering hundreds of original images from some models.

These findings suggest a trade-off: while IARs excel in image generation, they are more likely to unintentionally leak private information from their training data. This is particularly important when the training data contains sensitive or copyrighted material.

From a societal perspective, this research highlights a potential risk in the widespread use of advanced generative models. It emphasizes the need for privacy-aware model design and clearer policies on the use of training data. Ensuring that generative AI does not inadvertently reveal private or proprietary content is essential for building trustworthy and ethical AI systems.

read full paper

Researchers from CISPA Helmholtz Center and Carnegie Mellon University have developed a new method that helps identify whether a text dataset has been used to train a large language model (LLM)—a crucial step for enforcing copyright protection in AI.

The core idea is based on “Dataset Inference” (DI), a technique that allows data owners to check if their content was used in training without their consent. However, traditional DI methods depend on having a separate, similar dataset (called a “held-out” set) that is known not to have been used in training. In practice, such held-out data is often unavailable or unreliable, making DI less useful in real-world scenarios.

To solve this, the researchers propose generating this held-out data synthetically. First, they train a text generator on the suspect dataset using a carefully designed task where the model completes text fragments. This results in synthetic data that closely mimics the original style and content. Since even small differences between real and synthetic data can lead to incorrect conclusions, the team also introduces a calibration step. It compares how a model responds to real and synthetic data and carefully separates differences due to membership (i.e., whether data was used in training) from those caused by generation artifacts.

In extensive experiments on various datasets, including blog posts and sections of the Pile dataset, the method achieved high accuracy in detecting whether a dataset had been used in training—while keeping false positives low.

This research offers a practical and robust tool for data creators who want to verify the use of their content in AI training. It strengthens the possibility of defending intellectual property rights in the context of modern AI, without making unrealistic assumptions about data availability.

Read full paper

Researchers from CISPA have developed a method to adapt large language models (LLMs) to private tasks without compromising user data or requiring access to the full model. This method, called POST (Privacy Of Soft-prompt Transfer), addresses a key challenge in the use of LLMs: customizing models to specific use cases while preserving privacy and reducing computational cost.

Normally, adapting an LLM to a specific task using “soft prompts” requires access to both the model and the user's private data at the same time—a setup that often violates privacy or is technically impractical. POST overcomes this by introducing a multi-step process: First, the LLM provider creates a smaller, simplified version of their model using a method called knowledge distillation. This smaller model is then sent to the user, who tunes a soft prompt locally on their private data. To ensure that the prompt cannot leak sensitive information, users can apply differential privacy techniques during this tuning.

Once the soft prompt is trained, it is sent back to the LLM provider, who uses a public dataset to transfer the prompt to their full model without accessing the private data itself. Experiments with various LLMs and tasks show that this approach not only improves performance on specific tasks but also significantly reduces the amount of computational power needed by users.

From a societal perspective, this work contributes a practical and privacy-preserving way for individuals or organizations to benefit from LLMs without having to share their sensitive data or invest in high-end computing infrastructure. It supports fairer access to powerful AI tools while respecting the privacy of users and the intellectual property of model providers.

Read full paper

In this study, the researchers investigate how decentralized optimization—where many networked devices collaboratively train machine learning models—can be made more efficient, especially in terms of reducing communication and computation costs. This is important because exchanging information between devices is often slower and more costly than local processing, especially in settings like federated learning or sensor networks.

Existing approaches in decentralized optimization sometimes rely on complex calculations that each device must perform locally, or they require a high degree of accuracy when solving intermediate mathematical problems. These factors can make such methods impractical in real-world scenarios with limited computing power or constrained network bandwidth.

To address these challenges, the researchers introduce a new method called Stabilized Proximal Decentralized Optimization (SPDO). SPDO is designed to balance two key demands: reducing how often devices need to communicate with one another and limiting the amount of computation each device must perform. The method builds on existing techniques but refines them in two significant ways. First, it relaxes the need for precise local calculations. Second, it takes advantage of the fact that many devices often work with similar data. By exploiting this similarity, SPDO is able to coordinate learning more efficiently.

The researchers also present an advanced version called Accelerated-SPDO, which further improves performance. Through theoretical analysis and experiments, they show that their methods match or outperform existing techniques, both in terms of speed and resource use.

From a societal perspective, this research helps make decentralized AI systems more practical, scalable, and accessible. It supports efforts to protect user privacy—since data stays local—and reduces the energy costs associated with training models across distributed networks. The methods may be particularly valuable for applications in healthcare, environmental monitoring, or personalized services where privacy and efficiency are critical.

Read full paper

Backdoor attacks on pre-trained language models (PTLMs) are a known threat, where malicious actors manipulate models so they behave normally under typical use but misbehave when triggered by specific inputs. In this study, the researchers focused on a less explored consequence of such attacks: how these manipulated models can unintentionally behave abnormally even in unrelated downstream tasks. They refer to this issue as “backdoor complications.”

To examine this phenomenon, the researchers conducted extensive experiments using popular language models such as BERT, GPT-2, and T5. They showed that models fine-tuned from a backdoored PTLM often show unusual and inconsistent behavior—even when used for tasks the attacker did not specifically target. For example, when a trigger word like “Trump” is inserted into a sentence, a model trained for topic classification might misclassify nearly all inputs into the same category, such as "sports." These skewed outputs are clear signs that the backdoor unintentionally affects unrelated tasks.

To address this, the team developed a mitigation method inspired by multi-task learning. Their approach involves training the PTLM on several unrelated tasks alongside the intended backdoor task. This reduces the chances that the backdoor causes complications in other tasks, without weakening the attack itself. Importantly, the attacker does not need to know what the downstream task will be—making this method practical for real-world scenarios.

This research contributes to a more nuanced understanding of the risks involved in using PTLMs from untrusted sources. It shows that backdoors can have effects beyond their intended target, which may actually make them more detectable. Highlighting these unintended side effects adds an important perspective to the debate on AI model security and encourages better safeguards in AI deployment.

Read full paper

In this study, we examined how explicit regularization—such as weight decay—interacts with the so-called implicit bias in machine learning models. Implicit bias refers to the natural tendency of optimization algorithms to prefer certain types of solutions, even without being explicitly told to do so. While this bias often leads to models that generalize well, it is not always predictable or easily controlled. On the other hand, explicit regularization adds a clear penalty during training, encouraging simpler or sparser models. These two forces typically act together, but their joint effects are not yet fully understood.

We analyzed this interaction within a mathematical framework known as mirror flow, which helps describe how model parameters evolve during training. We found that explicit regularization has a lasting influence on the so-called geometry of learning—shaping the kinds of solutions a model will ultimately favor. Specifically, we identified three main effects: it can shift where the learning algorithm looks for solutions (positional bias), change the kind of simplicity it prefers (type of bias), and narrow the range of possible outcomes (range shrinking). Our theoretical analysis was supported by experiments on various learning tasks, including matrix recovery, transformer attention mechanisms, and fine-tuning large language models with a technique called LoRA.

An important practical insight is that switching off regularization during training can sometimes improve generalization. This happens because the regularization alters the learning trajectory early on and then lets the model explore a beneficial space more freely later. Overall, our work helps clarify how to use regularization more effectively and flexibly.

From a societal standpoint, these findings contribute to making machine learning systems more reliable and adaptable. By better understanding how models learn and generalize, we can improve performance in settings such as medical diagnosis, language modeling, or climate forecasting—without needing to significantly increase data or computational demands.

Read full paper

In this work, we investigated new ways to represent and compare probability distributions using kernel methods, a class of techniques common in modern statistics and machine learning. The standard method, known as kernel mean embedding (KME), represents a distribution by its average value in a special mathematical space called a reproducing kernel Hilbert space (RKHS). While effective, this method represents a distribution using only the mean function in RKHS, neglecting higher-order statistics.

To address this, we proposed an alternative approach based on *quantiles*—values that indicate the position of data points within a distribution. We introduced *kernel quantile embeddings* (KQEs), which capture distributional information by considering directional quantiles in the RKHS. From this, we developed a new family of distance measures called *kernel quantile discrepancies* (KQDs) that offer more nuanced comparisons between distributions.

Our theoretical analysis showed that KQEs can uniquely represent a distribution under milder conditions than those required for KMEs, making them more broadly applicable. We also demonstrated that our methods are computationally efficient and scale well with large datasets. In empirical tests, including challenging real-world tasks like distinguishing between nearly identical image datasets, our quantile-based methods performed on par with or better than established approaches, especially when computational resources were limited.

From a societal perspective, this research provides tools for more accurate and efficient statistical comparison, which is central to applications ranging from scientific discovery to machine learning model evaluation. Our methods can support better decision-making in scenarios where understanding differences between data distributions is crucial, such as in medical diagnostics, environmental monitoring, or fairness auditing in AI systems.

Read full paper

In our research, we addressed a challenge in distributed game-theoretic optimization: how multiple players (or agents) can find optimal strategies when communication between them is limited and potentially unreliable. This situation arises in many real-world settings—for example, when competing companies must make decisions based on incomplete or outdated information about their rivals’ strategies, or when robots must coordinate actions with only occasional data exchanges due to energy or bandwidth constraints.

To tackle this, we developed Decoupled SGDA, an adaptation of a well-established algorithm used to solve minimax optimization problems, which are common in adversarial machine learning and game theory. Our method allows each player to perform several updates based on old information from other players before synchronizing their strategies. This stands in contrast to traditional approaches that require constant back-and-forth communication, which is often impractical.

We analyzed the convergence behavior of our method under various conditions and found that it performs especially well when the players' objectives are only weakly connected—a situation we call a "weakly coupled game". In such cases, Decoupled SGDA can achieve comparable results to existing methods but with significantly fewer communication rounds. Moreover, we showed that it remains robust in scenarios where players receive noisy feedback, outperforming other techniques that break down under such imbalanced noise conditions.

Our experiments—spanning theoretical models, non-convex games, and practical tasks like training generative adversarial networks (GANs)—confirmed these findings and demonstrated that our approach effectively reduces communication without sacrificing performance.

This work contributes to society by offering a more communication-efficient and resilient method for distributed decision-making. It is particularly relevant for modern applications like federated learning, decentralized AI systems, and autonomous agents operating with limited connectivity—scenarios increasingly common in both industry and research.

Read full paper

These summaries have been created with the assistance of ChatGPT.

Stealix: Model Stealing via Prompt Evolution

Pixel-level Certified Explanations via Randomized Smoothing

Privacy Attacks on Image AutoRegressive Models

Unlocking Post-hoc Dataset Inference with Synthetic Data

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization

The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?

Kernel Quantile Embeddings and Associated Probability Metrics

Decoupled SGDA for Games with Intermittent Strategy Communication