Send email Copy Email Address

17 CISPA papers at NEURIPS 2025

The Conference on Neural Information Processing Systems, commonly known as NeurIPS, is one of the most prestigious and influential conferences in the fields of artificial intelligence (AI), Machine Learning (ML), and data science.

The researchers examined how large language models behave when two different tasks are interwoven word by word instead of being asked one after the other. They found that models can usually still solve at least one of the tasks, showing that they cope better with such mixed inputs than expected.

More importantly, the study reveals a safety weakness. If a harmful request is hidden within an interleaved prompt, moderation systems are less likely to detect it. This makes it easier for harmful content to slip past guardrails.

Building on this insight, the researchers developed JAIL-CON, an automated method that repeatedly combines a harmful question with harmless ones until the model produces a harmful answer. Across several well-known language models, JAIL-CON achieved higher jailbreak success rates than existing methods and generated outputs that were more difficult for safety filters to identify.

From the researchers’ perspective, these findings show that current safety mechanisms assume sequential input and therefore overlook a relevant vulnerability. Identifying this gap is important so that future moderation systems can better handle non-standard or intentionally manipulated prompts.

For society, the research highlights a real risk but ultimately serves to strengthen AI safety: by uncovering this blind spot early, it enables developers and policymakers to design more robust protection mechanisms against misuse.

The researchers investigate whether common data-augmentation techniques—typically used in machine learning to improve performance on standard prediction tasks—can also help when the goal is causal inference. In causal problems, we want to understand how changing one variable (the “treatment”) influences another (the “outcome”). This is often difficult because the two may be linked through hidden factors, creating biased estimates.

The central idea of the paper is that certain data-augmentation operations can be interpreted as if they were “soft interventions” on the treatment variable, provided that these transformations do not change the true outcome mechanism. If this condition is met, augmented data can mimic the effect of experimentally varying the treatment and thereby reduce bias caused by hidden confounding.

Building on this, the authors introduce the concept of “IV-like” variables. These resemble instrumental variables—auxiliary variables that traditionally allow unbiased causal estimation—but do not need to satisfy all the strict conditions that true instruments require. By combining data augmentation with a regularized form of instrumental-variable regression, the authors show that one can further reduce confounding bias and make predictions more reliable even when proper instruments are unavailable.

The paper analyzes these ideas theoretically in simple linear models and demonstrates them in simulations and real datasets. Across experiments, outcome-invariant data augmentation tends not to harm causal estimation and often improves it. When combined with the proposed IV-like regression, performance can improve further, especially in situations where the augmentation happens to target parts of the data most affected by confounding.

From a societal perspective, this research offers a cautious but meaningful contribution: it suggests a way to make causal conclusions more robust in fields where controlled experiments or valid instrumental variables are hard to obtain. While it does not remove the need for domain knowledge or careful assumptions, it provides a practical tool that could improve decision-making in areas such as healthcare, economics, and scientific modeling.

The researchers present a new watermarking method called BitMark, designed specifically for modern image-generating autoregressive models. Their goal is to embed information into generated images in a way that is hard to erase, easy to detect, and does not noticeably reduce image quality. Existing watermarking approaches work mainly for language models or diffusion models, but they do not reliably transfer to autoregressive image generation. BitMark aims to close this gap.

BitMark embeds a watermark by slightly influencing the model’s bit-level predictions during image generation. The method uses two lists of bit patterns—one “green list” and one “red list.” During generation, the model is softly nudged to pick green-list patterns more often. Because these patterns appear in many small locations throughout the generated image, the watermark is spread across the entire image, making removal difficult. Detection is done statistically by counting how often green patterns occur and comparing this to what would be expected in a non-watermarked image. The authors show that this test remains reliable even under various image manipulations.

The experiments indicate that BitMark is robust to many common attacks, such as noise, compression, cropping, and more advanced attempts to modify image content. The authors also introduce a new, stronger attack—Bit-Flipper—and show that removing the watermark requires such heavy manipulation that the image quality becomes visibly degraded. They further demonstrate “radioactivity”: models trained on watermarked images inherit the watermark patterns and produce outputs where the watermark remains detectable.

From a societal perspective, this research contributes to more trustworthy image-generation systems. Reliable watermarking can support transparency about whether an image was produced by an AI model and help address concerns about misinformation or unclear content origins. The work offers a technical step forward without claiming that watermarking is a complete solution to these broader challenges.

The researchers address a common problem in data analysis: real-world datasets often contain hidden subgroups that follow different cause-and-effect relationships. If these differences are ignored, conclusions about what causes what can easily become misleading. To deal with this, the authors propose a new framework called causal mixture models, which assumes that each variable may be generated by one of several underlying mechanisms, depending on unobserved group membership.

In this framework, each variable still depends on its direct causes, but the exact form of this dependency can vary between hidden groups. The authors develop methods to infer not only the causal relationships between variables, but also the hidden groups and the way each group changes the causal mechanisms. Their approach integrates mixture modelling with established causal-discovery techniques, and it uses statistical criteria to decide how many hidden groups likely exist and which variables they influence.

Through mathematical analysis, the researchers show that their method can, under reasonable assumptions, recover the correct causal structure even when hidden group differences are present. They also provide an algorithm to perform this joint discovery in practice. In extensive experiments on synthetic datasets, on data containing mixtures of experimental conditions, and on a real biological dataset, the approach generally identifies the hidden subgroups and recovers the underlying causal graph more accurately than existing methods. However, the researchers also note cases where limitations of the modelling assumptions—such as the assumption of linear relationships—restrict the method’s ability to detect the true group structure.

From a societal perspective, this work supports more reliable scientific analysis in fields where heterogeneous populations are common, such as medicine, biology, and the social sciences. More accurate identification of both causal relations and hidden subgroups can help prevent incorrect conclusions and encourage more precise, evidence-based decision-making.

The researchers investigate whether specific behaviors of large language models can be traced back to individual neurons inside the network, and whether changing those neurons can reliably modify the model’s behavior. Their goal is to understand models in a more targeted way and to see whether small, precise edits can replace broad and unpredictable methods that affect many parts of a model at once.

They study this in a controlled setting where the model must avoid repeating particular phrases that would reveal the training data used. To do this safely, they build a modified dataset and use an automated method to search for “critical neurons” that react strongly whenever the model is about to produce the unwanted text. Once they identify such neurons, they test whether adjusting their activity at the right time prevents the model from generating the sensitive phrases.

The results show that these neurons can be reliably found, even in very large models. More importantly, individually reactivating or suppressing just a few neurons is often enough to prevent the unwanted behavior without noticeably affecting other abilities. This suggests that some model behaviors—at least in narrowly defined tasks—are more localized than previously thought. The researchers also compare neuron-level interventions with existing editing methods and find that their approach is more targeted and causes fewer side effects. However, they emphasize that this does not mean all complex behaviors can be reduced to single neurons, nor that their method covers the full range of model safety or interpretability challenges.

From a societal perspective, this work provides a step toward more transparent and controllable AI systems. Understanding where specific behaviors originate and how to adjust them in a minimal way may help support safer and more predictable models. At the same time, the research shows the limits of such techniques and highlights the need for broader approaches to responsible model development.

This paper investigates why graph neural networks (GNNs) tend to memorize training data and under which conditions this memorization becomes stronger or weaker. The researchers approach the topic systematically by proposing a framework that allows them to analyze memorization across different kinds of graphs and tasks. Their work focuses on understanding the mechanisms that cause GNNs to overfit rather than on introducing a new model.

According to the authors, memorization in GNNs is influenced by several structural properties of graphs. One important factor is homophily, meaning how often connected nodes share the same label. When homophily is high, GNNs more easily infer labels from neighborhoods, which can contribute to memorizing specific training patterns. Another factor is label informativeness, which reflects how much knowing a neighbor’s label helps predict a node’s label; higher informativeness increases the risk of memorization. They also examine kernel alignment, a mathematical way of measuring how well the structure of the graph matches the ideal structure for the learning task. Finally, they identify feature-space label inconsistency—cases where nodes that look similar in terms of features actually belong to different classes—as a contributor to unstable learning and memorization.

The researchers support their claims by analyzing how these factors interact during the training of common GNN architectures, focusing especially on graph convolutional networks. They further discuss how memorization can relate to privacy risks, as models that memorize individual training instances may unintentionally reveal sensitive information.

Overall, the work offers a clearer understanding of when and why GNNs memorize data, without proposing easy fixes. The main contribution to society lies in improving transparency around GNN behavior. This foundation can help developers design models that generalize better and leak less information, ultimately contributing to safer and more reliable use of graph-based machine learning systems. 

The researchers present a method called GASP, which is designed to test how easily large language models can be tricked into producing harmful content. Their focus is on creating short text additions—called “adversarial suffixes”—that, when attached to a user’s prompt, persuade a model to ignore its built-in safety rules. Unlike many earlier methods, GASP works without accessing the internal workings of a model and aims to keep the resulting prompts readable and natural-sounding.

To achieve this, the researchers train a smaller model to generate potentially harmful suffixes and then refine these suffixes inside the model’s internal embedding space, where text is represented as continuous numerical vectors. This allows them to search more efficiently for suffixes that reliably trigger unsafe responses. They combine this with a statistical optimisation technique and an additional training step that adjusts the smaller model based on feedback from real model outputs. The system evaluates each attempt using a custom scoring process that distinguishes clearly harmful replies from harmless ones, including borderline cases where warnings and unsafe content appear together.

In their experiments, GASP succeeds in provoking harmful outputs from a wide range of open-source models and even from the most advanced commercial LLMs, often with fewer queries and more coherent prompts than previous attack methods. They also find that many common defences can still be bypassed. The method performs especially well when multiple attempts are allowed, and it maintains readability better than other automated approaches.

From the researchers’ perspective, the main contribution of this work is to provide a more efficient and realistic way to test the robustness of language-model safety mechanisms. For society, its value lies in helping developers and evaluators better understand where current safeguards fail, which can support the design of more resilient and trustworthy AI systems.

This paper is about improving how machine learning systems represent and measure uncertainty, especially in situations where the available information is incomplete or ambiguous. The researchers focus on “imprecise probabilities,” which describe uncertainty not as a single number but as a range of possible values. This approach aims to better reflect the limits of knowledge that arise in many real-world problems.

The researchers develop a mathematical framework called Integral Imprecise Probability Metrics, which makes it possible to compare and analyze these imprecise models in a principled way. To do so, they extend standard tools for comparing probability distributions by using a type of integral that can represent ambiguity more faithfully. They show that their framework has desirable properties, such as behaving like a proper distance measure in many cases and capturing meaningful differences between uncertainty models.

Building on this, the researchers introduce a new measure of epistemic uncertainty—uncertainty caused by lack of knowledge—called Maximum Mean Imprecision. This measure compares “optimistic” and “pessimistic” versions of a model’s predictions to quantify how much the model does not know. They demonstrate that Maximum Mean Imprecision satisfies several important logical requirements identified in earlier research. Their experiments show that the measure performs reliably on classification tasks and remains computationally manageable even when the number of classes is large.

From the researchers’ perspective, the main value of this work is to strengthen the theoretical and practical foundations for uncertainty-aware machine learning. For society, it offers tools that can help AI systems communicate uncertainty more clearly, which is important in areas such as medicine, scientific modeling, and decision-making where understanding the limits of available information is critical.

The researchers investigate why a common training technique for neural networks, called Label Smoothing, sometimes behaves in unexpected and counterproductive ways. Label Smoothing is meant to prevent models from becoming overly confident in their predictions. However, recent observations show two issues: it can make models *more* confident when they are actually wrong, and it can compress the model’s internal representations so tightly that subtle differences within the same class get lost.

The authors examine the mathematics behind Label Smoothing and identify the cause: the method unintentionally contains an error-amplification term. When the model makes a mistake, this term encourages it to reinforce the wrong prediction, increasing the model’s confidence in the incorrect answer and pushing representations toward an overly uniform structure.

To address this, the researchers propose Max Suppression (MaxSup). Instead of reducing the confidence of the true class (as Label Smoothing does), MaxSup reduces the confidence of whichever class the model currently believes most—whether this belief is correct or not. This creates a more consistent and fair form of regularization: correct predictions are still prevented from becoming overconfident, while incorrect predictions are not unintentionally strengthened.

Experiments show that MaxSup keeps more natural variation within each class, strengthens the separation between classes, and improves accuracy on standard benchmarks such as ImageNet. Models trained with MaxSup also transfer better to new tasks, like semantic segmentation, and show more meaningful attention patterns in visualization tools.

From a societal perspective, this research provides a clearer understanding of how widely used training techniques behave and offers a simple, low-cost improvement. Better-calibrated and more robust models can contribute to safer and more reliable AI systems in areas where classification quality and interpretability matter, without introducing new risks or dependencies.

The researchers investigate how errors or manipulations in training data can compromise the reliability of AI systems. Existing defenses usually rely on heuristics and often fail against new or more complex attacks. The authors therefore propose MIBP-Cert, a new method that provides provable guarantees about how much training data perturbations can influence a machine-learning model.

The core idea is to model the entire training step—forward pass, loss computation, backward pass, and parameter update—as a single mathematical optimization problem. By solving this problem, the method determines all possible parameter values a model could take if the training data were perturbed within a specified range. These “reachable parameter sets” allow the researchers to guarantee when a model’s predictions will remain unchanged, even under worst-case manipulations. Unlike earlier approaches, which rely on coarse approximations and tend to become unstable, MIBP-Cert preserves the exact relationships between variables within each training step. This leads to tighter and more stable bounds.

Experiments on synthetic and real datasets show that the method delivers higher and more consistent certified accuracy, especially when perturbations are large. It also supports more complex scenarios—such as uncertain survey responses or missing health data—that previous certification techniques could not handle. The main trade-off is computational cost: solving many mixed-integer bilinear programs is slower than using simpler approximations, though still manageable for small to medium-sized models.

From a societal perspective, this research provides a principled way to understand and control how training data quality affects AI behavior. As AI systems increasingly rely on heterogeneous, noisy, or user-generated data, methods like MIBP-Cert can help ensure that models behave predictably even when the data they learn from are imperfect. While not a complete solution to all data-quality challenges, it offers a step toward more transparent and trustworthy AI systems. 

The researchers present NEURULES, a new method for creating rule-based machine-learning models that remain easy for humans to understand while still achieving high predictive accuracy. Rule lists are simple “if–then–else” structures: they check conditions in order and make a decision based on the first rule that applies. These models are valued in areas such as medicine or credit scoring, where transparent decisions are important. However, existing techniques often struggle because they must simplify continuous data beforehand, limit the complexity of rules, or cannot efficiently search through the large number of possible rule combinations.

NEURULES addresses these challenges by turning rule-list learning into a form that can be optimized with gradient-based training, similar to how neural networks are trained. Instead of defining rule conditions in advance, the method learns the thresholds directly from data. It also builds rules and decides their order during training, without needing manual constraints. A key element is a “gradient-shaping” mechanism that naturally encourages rules to use only the conditions that truly matter, keeping them short and readable. Another component allows the ordering of rules to be learned smoothly and then converted into a strict sequence once training is complete.

Across many real-world datasets, including binary and multi-class problems, NEURULES performed as well as or better than established rule-learning systems, often producing more compact models with competitive accuracy. The method also scales to larger datasets where exact or combinatorial approaches become too slow.

From a societal perspective, this research contributes to the development of interpretable machine-learning tools that can support transparent and accountable decision-making. By improving the balance between accuracy and interpretability, NEURULES may help practitioners adopt machine learning in sensitive domains with greater confidence, while still requiring careful evaluation and responsible use in real applications.

This research investigates how to make conformal prediction—a statistical method that produces reliable, model-agnostic prediction sets—more robust to input noise, while keeping computation practical. Existing robust conformal prediction methods often rely on randomized smoothing, which adds noise to inputs so that small perturbations do not change predictions too much. However, these methods typically require dozens or even hundreds of repeated model evaluations per input, making them too slow for many real-world applications.

The researchers show that much of this repetition is unnecessary. Their central observation is that standard conformal prediction, when combined with just one noise-augmented inference, already displays a surprising degree of robustness. Building on this insight, they design RCP1, a method that needs only a single noisy forward pass per input. Instead of certifying the robustness of individual model outputs, RCP1 certifies the robustness of the conformal prediction procedure as a whole, which simplifies computation substantially.

RCP1 is compatible with any underlying model and works for both classification and regression. Despite its low computational cost, it achieves prediction set sizes comparable to state-of-the-art methods that require tens or hundreds of samples. The authors also extend their approach to conformal risk control, allowing the method to deliver robust guarantees for tasks like image segmentation where errors are measured differently than simple misclassification.

In practice, RCP1 yields faster inference, works on larger and more accurate models that would be too expensive for traditional smoothing-based methods, and still provides formal robustness guarantees. A known limitation is that results can vary slightly more from one noisy sample to another, but this does not undermine overall reliability.

From a societal perspective, this research contributes to making machine-learning systems more predictable and trustworthy under realistic conditions—including noise, perturbations, and adversarial inputs—while keeping computational demands reasonable. This can support safer deployment of AI in settings where reliability is essential.

The researchers investigate why finetuning large neural networks is so resource-intensive and whether all parameters really need to be updated. Across multiple experiments in language and vision models, they observe a consistent pattern: during finetuning, the strongest gradient signals—those that drive learning—tend to occur in parameters whose values are very small. In contrast, large-magnitude weights, which often encode important knowledge learned during pretraining, receive smaller gradients and therefore change less.

Building on this observation, the authors propose NANOADAM, an optimizer that updates only the parameters with the smallest absolute values. Unlike other approaches, this method does not need gradients to decide which parameters to update, can precompute the selection mask, and avoids maintaining large portion of unnecessary momentums. The researchers show theoretically, using a simplified neural-network model, that updating small weights helps the model learn new information while leaving core representations intact. In other words, it reduces the risk of “catastrophic forgetting,” where adapting to a new task harms previously learned abilities.

In experiments on standard language and vision benchmarks, NANOADAM often matches or outperforms memory-efficient baselines. It enables larger learning rates, achieves better generalization, and results in much smaller overall shifts to pretrained parameters. In continual-learning tests, it preserves knowledge significantly better than full-update methods such as AdamW, particularly in larger and more overparameterized models. The method does have limitations: it relies on models being highly overparameterized and on some degree of similarity between the pretraining and finetuning tasks. When these conditions are not met, full-update optimizers may still be preferable.

For society, this research offers a more resource-efficient way to adapt large models to new tasks. By reducing memory and computation demands—and by limiting the loss of previously learned capabilities—it may help make advanced AI models more accessible, more reliable, and less costly to deploy in practical applications.

The researchers examine why a widely used method for distributed machine learning—Local SGD—often works better in practice than theory has so far explained, especially when data across participating devices differ. They focus on a type of data variation called second-order heterogeneity, which describes how strongly the “curvature” of each device’s learning problem differs from others. Earlier theoretical work suggested that this factor might play a key role, but a complete explanation was missing.

The paper confirms that second-order heterogeneity indeed governs how efficiently Local SGD can learn when communication between devices is limited. The researchers do this by establishing new mathematical lower and upper bounds on how fast the method can converge. These results show that when the devices’ data differ only mildly in this second-order sense, Local SGD can achieve good accuracy with fewer communication rounds. This helps explain why the method often outperforms alternatives such as mini-batch SGD.

A key technical advance is a more precise analysis of the consensus error—the temporary differences between the local models on each device before they are averaged. The authors show how this error depends on different forms of data heterogeneity and derive bounds that avoid previous restrictive assumptions. They then extend their analysis to cases where the learning objectives are smoother or even exactly quadratic, obtaining sharper guarantees. Controlled experiments on synthetic regression tasks support the theoretical findings and illustrate how first-order and second-order heterogeneity affect performance differently.

From the researchers’ perspective, these results narrow several theoretical gaps and bring the understanding of Local SGD closer to real-world behavior. For society, the work offers a more solid foundation for communication-efficient and privacy-preserving learning methods, which are important in settings such as mobile devices and health applications. While not solving all challenges, it improves the reliability of tools that allow data to remain decentralized while still supporting effective collective learning.

The researchers investigate why training sparse neural networks directly from scratch still performs noticeably worse than training a dense network first and pruning it later. Their analysis shows that one overlooked but crucial element is the sign (positive or negative) of each weight. During dense-to-sparse training, many weights flip signs early and then stabilize. These early sign flips help the network move toward flatter, more robust solutions—something sparse training from scratch does not achieve reliably.

The authors demonstrate that if one could start sparse training with the right signs already aligned with the chosen sparse mask, performance would nearly match dense-to-sparse methods. However, sparse training fails to discover these signs on its own because sign flips are difficult once the model begins from a highly constrained, sparse state.

To address this, they introduce Sign-In, a reparameterization technique that gives each parameter an additional internal degree of freedom. This alteration changes how gradients act on the weights and makes sign flips more accessible during training. In simple theoretical models, Sign-In provably recovers correct signs in situations where standard sparse training fails. In large-scale experiments on common vision benchmarks, Sign-In consistently improves the accuracy of sparse models trained from scratch and also enhances some existing sparsification methods. Nevertheless, Sign-In still does not fully match the performance of approaches that begin with dense training, and the authors prove that no reparameterization alone can replace the benefits overparameterization provides.

Overall, this research clarifies a key mechanism—early sign alignment—behind the success of dense-to-sparse training and contributes a practical method that narrows the gap for sparse training from scratch. This can support future efforts to develop more efficient neural-network training pipelines and reduce computational costs, though dense training phases cannot yet be eliminated entirely.

The researchers investigated how well strong membership inference attacks can reveal whether a specific text sample was part of a large language model’s training data. To do this, they scaled a powerful attack method (“LiRA”) to an unusually large setting: thousands of GPT-2–style models trained on tens of billions of tokens. This allowed them to test these attacks under conditions much closer to real LLM training than previous work.

They found that strong attacks can indeed outperform random guessing, but only to a limited extent under realistic training setups. Even when using many reference models, attack accuracy typically stayed below a commonly used threshold (AUC 0.7). Models trained in the usual “compute-optimal” way were not highly vulnerable, and larger models were not automatically easier to attack.

The team also uncovered an important nuance: even when the attack seems successful on average, the predictions for individual samples can be extremely unstable. Because training runs vary slightly depending on factors such as batch order, many membership decisions behave like coin flips—especially for samples whose statistical “signal” is weak or ambiguous. This means that an attack may guess correctly but without using reliable information.

The researchers further examined why some samples are more vulnerable. They observed that samples seen later during training and samples with greater length tend to be more at risk. However, they found no clear connection between samples that are vulnerable to membership inference and those that are easy to extract through standard training-data extraction attacks, suggesting these two risks reflect different types of memorization.

Overall, the research provides a clearer, more realistic benchmark for assessing privacy risks in LLMs. It shows that while strong membership inference attacks can work, their practical impact is presently limited. This helps society by grounding privacy discussions in empirical evidence and clarifying where genuine risks—and effective defenses—still need deeper investigation.

The researchers investigate why some sparsely connected neural networks are easier to train than others, even when they have the same number of remaining connections. Their central idea is to view pruning—the process of removing weights—as creating a sequence of graphs that describe which neurons remain connected. As networks become wider, these graphs grow larger and more regular. The authors propose that these growing graphs converge to a “graphon,” a mathematical object that captures the limiting connectivity pattern of a pruning method.

They test this hypothesis using several popular pruning-at-initialization techniques. Their experiments show that each method consistently produces masks that approach a distinctive graphon as network width increases. For example, random pruning leads to a uniform pattern, while other methods create more structured patterns that favor certain neurons.

Building on this, the authors develop the “Graphon Neural Tangent Kernel” (Graphon NTK). This tool allows them to study how these limiting connectivity patterns influence the dynamics of training very wide sparse networks. They show that the kernel’s spectral properties—particularly how its eigenvalues are distributed—correlate with how quickly and effectively a sparse network begins to learn. Methods such as SNIP and Synflow concentrate the kernel’s energy in a way that leads to faster early training compared with random pruning.

Taken together, the work provides a unified mathematical framework for describing sparse network structures and understanding their trainability. While the analysis applies to idealized infinite-width settings, it offers practical insight: pruning strategies can be compared and potentially designed based on their induced graphons and the resulting kernel behavior.

For society, this research contributes foundational understanding rather than direct applications. By clarifying how to create sparse networks that train reliably, it may support future development of more efficient AI systems that require less computation and energy, without overstating immediate impacts.