Since 1975, USENIX has brought together the community of engineers, researchers, and practitioners working on the cutting edge of the computing world. USENIX conferences have become the essential meeting grounds for the presentation and discussion of the most advanced information on the development of all aspects of computing systems.
In this study, the researchers examined whether vision-language models (VLMs)—AI systems that analyze both images and text—can reliably detect unsafe or inappropriate content across different formats. These models are increasingly used in areas like content moderation, where consistent ethical judgments are important. A key concern is the so-called “modality gap,” where the same concept might be judged differently depending on whether it is presented as text or image.
To assess this, the researchers compiled a dataset called UnsafeConcepts, containing 75 types of unsafe content (such as hate symbols, sexual harassment, or self-harm) along with over 1,500 images. They tested eight VLMs on their ability to both recognize unsafe content (perception) and judge it appropriately (alignment) in general safety contexts, like whether it’s suitable for social media.
The findings show that while most models could detect the presence of unsafe elements in images, they often failed to judge these images as unsafe in a broader context. The models performed much better when the same concepts were described in text, confirming the presence of a consistent modality gap.
To address this, the researchers developed a simplified reinforcement learning method to better align models’ behavior on images. Their approach did not rely on manually written training data but used an automated scoring system to guide model updates. The technique improved both the models' ability to judge visual content ethically and their capacity to provide informative explanations, without significantly harming their general performance.
From a societal perspective, this research helps improve the safety and reliability of AI systems used in public-facing applications. By identifying and reducing inconsistent behavior across image and text inputs, it contributes to building AI that behaves more responsibly in sensitive contexts.
This research addresses a challenge in the security testing of embedded systems: accurately emulating Direct Memory Access (DMA), a common way for devices to transfer data without involving the main processor. While rehosting—running embedded software in a simulated environment—is increasingly used to test firmware security, existing methods have struggled to handle the complexity of DMA, especially when it is configured in less straightforward ways.
To solve this, the researchers developed GDMA, a fully automated approach that can emulate all six known types of DMA configurations used in popular embedded devices. The method works without needing source code, documentation, or prior knowledge about the hardware. GDMA identifies how the firmware sets up and uses DMA by observing memory access patterns during execution. It builds a model of the DMA behavior by analyzing how firmware interacts with memory and peripheral interfaces, then feeds simulated input into these DMA channels to enable realistic testing.
In experiments on 114 firmware samples, GDMA significantly outperformed the only existing automated tool in this area. It was able to emulate six times more DMA mechanisms and covered all firmware types in two benchmark sets. When integrated with a testing tool, GDMA improved code coverage by up to 152% and helped uncover six previously unknown security vulnerabilities in real-world embedded software. These vulnerabilities were reported and received official identifiers (CVEs), confirming their relevance.
From a societal perspective, this research contributes to making embedded devices—used in everything from medical equipment to industrial systems—more secure. By enabling more complete and automated testing of firmware, GDMA helps identify weaknesses that might otherwise go unnoticed, ultimately supporting safer and more reliable digital infrastructure.
This research addresses a vulnerability in how large language models (LLMs) are adapted for specific purposes using system prompts—short instructions that guide the model's behavior. These prompts can be highly valuable, often representing intellectual property, but are easily exposed through so-called "prompt injection" attacks. Until now, there has been no effective method to prevent system prompts from being copied or stolen.
To mitigate this, the study proposes and evaluates a method called prompt obfuscation, which aims to preserve the system’s behavior while concealing the content of the system prompt. Two forms of obfuscation are introduced: one that modifies the prompt text itself ("hard" obfuscation), and another that operates within the model’s internal embedding space ("soft" obfuscation), where instructions are encoded as numerical vectors rather than human-readable text.
Experiments using established datasets show that the soft obfuscation method maintains the quality of model outputs while making it extremely difficult for attackers to reconstruct the original instructions—even with direct access to the model’s internal data. The hard obfuscation method, in contrast, was less secure and occasionally leaked partial information.
The approach was also tested on a real-world system prompt that had previously been leaked from a commercial application. The obfuscation technique successfully protected the prompt while preserving its intended function.
From a societal standpoint, this work offers a practical approach to safeguarding proprietary logic in AI systems without compromising performance. At the same time, the potential for misuse—such as concealing harmful or deceptive instructions—must be considered. The researchers therefore recommend combining technical protections with mechanisms for transparency and oversight to support responsible use of this technology.
This research investigates how software vulnerabilities are described in the Common Vulnerabilities and Exposures (CVE) database and examines whether the language used in these descriptions might unintentionally introduce bias. CVEs are widely used in cybersecurity to communicate information about known security issues in software. While these entries are primarily technical, they are also interpreted by a range of people—from IT professionals to decision-makers—who may rely on them when assessing risks.
The researchers conducted a detailed linguistic analysis of over 165,000 CVE descriptions to understand how the use of language has changed over time and whether it reflects implicit value judgments. They identified several recurring patterns. For example, certain terms tend to assign blame—such as implying that a developer failed to act—while others focus on the system or the vulnerability itself in a more neutral way. Over time, there has been a shift toward more neutral, technical phrasing, but evaluative or emotionally charged language still appears in many entries.
The study highlights that even small differences in wording can influence how a vulnerability is perceived—for example, whether it is seen as the result of negligence or as a structural issue. These perceptions can, in turn, affect how organizations prioritize fixes or assign responsibility.
From a societal perspective, the research contributes to a more reflective and transparent way of handling cybersecurity information. By drawing attention to the role of language in shaping security decisions, it encourages the development of clearer, fairer, and more objective vulnerability reporting. This is especially relevant as cybersecurity becomes an increasingly critical part of public infrastructure and trust in digital systems.
Browser extensions extend the functionality of Web browsers with additional features, but their highly privileged nature also creates security and privacy risks. In our research, we sought to understand how extension developers perceive and address these risks during the development process. We interviewed 21 developers from diverse backgrounds to understand their extension development practices and observed their behavior during two programming tasks focused on security and privacy, in addition to capturing their practices and experiences so far.
Our findings show that most developers are aware of general risks, such as data theft or misuse of browser permissions, but often lack concrete knowledge about how to avoid them. For instance, while many chose secure APIs to store user data, they often did so based on convenience rather than a clear understanding of the privacy benefits. Similarly, when faced with a task that involved modifying security-relevant browser headers, several participants hesitated or chose insecure or outdated solutions—again highlighting a gap between intention and knowledge.
Developers’ practices are also influenced by external pressures. Many reported that browser platforms provide insufficient guidance on secure development and that the review process for publishing extensions is opaque and inconsistent. Monetization challenges and platform-specific constraints further complicate the picture, leading some developers to deprioritize security and privacy in favor of functionality or business needs.
From a societal perspective, our research highlights a tension between user safety and the realities of software development. Developers often care about doing the right thing but lack the resources, incentives, or support to act accordingly. Addressing this gap—through more precise documentation, better developer tools, and fairer platform governance—could lead to more secure browser extensions and a safer web experience for everyone.
This research focuses on a growing threat in machine learning: data reconstruction attacks. These attacks aim to recover the original training data used to build a model, even when the attacker has only limited access to the model itself. Such threats raise serious privacy concerns, particularly when models are trained on sensitive information like personal images or medical records.
Although many studies have examined data reconstruction, they often rely on differing definitions, assumptions, and evaluation methods. This inconsistency makes it difficult to compare results or assess the true risk posed by these attacks. To address this issue, the paper introduces a comprehensive framework for defining and evaluating data reconstruction attacks, specifically in the context of image-based models.
The contributions are threefold. First, a clear definition and taxonomy is proposed, categorizing attacks based on the attacker’s access to model outputs and internal data. Second, a set of evaluation metrics is introduced to measure both the accuracy and the diversity of reconstructed data, incorporating both technical similarity scores and assessments using large language models to approximate human judgment. Third, the framework is applied to ten prominent attack methods across different scenarios to identify which techniques perform best under various conditions.
Results show that the quality of reconstructed data varies significantly depending on how much the model memorizes its training data and the extent of the attacker’s access. Existing evaluation metrics often diverge from human assessments, highlighting the need for better benchmarks.
This work provides a foundation for more consistent and meaningful comparisons in future research on data reconstruction. From a societal perspective, it improves understanding of privacy risks in machine learning and contributes tools that can help assess and mitigate these threats in a more systematic and transparent way.
The researchers address a long-standing challenge in cybersecurity: how to recover secrets, such as cryptographic keys, from software systems using only indirect clues—specifically, the patterns of memory accesses that leak through side-channel attacks. Traditional side-channel attacks often require deep manual analysis and expert knowledge of how a program is structured. The goal of this research was to make proof-of-concepts for such attacks more practical and accessible by automating the process of extracting secrets from binary programs.
To achieve this, the team developed SCASE, a method that combines symbolic execution—a technique for automatically analyzing software behavior—with side-channel memory traces. These traces reveal which parts of memory were accessed during a program’s execution, without needing access to the original source code. By guiding the symbolic analysis using this leaked information, SCASE drastically reduces the complexity of the search and allows the automatic reconstruction of the secret data.
To demonstrate their approach, the researchers built Athena, a prototype tool that applies SCASE to extract secrets from protected environments such as Intel SGX enclaves. Athena successfully recovered RSA, AES, and RC4 cryptographic keys, as well as non-cryptographic data such as inputs to poker-hand evaluators, all without requiring manual reverse engineering. The process worked reliably across different types of applications and memory leakage patterns.
From a societal perspective, this research highlights both a risk and an opportunity. On one hand, it shows that secrets in supposedly secure software can be extracted more easily than previously assumed, underlining the need for stronger defenses. On the other hand, the techniques developed here can help security researchers and developers find and fix vulnerabilities in complex systems more effectively, ultimately contributing to more secure computing environments.
This study investigates how transparently research is reported in the field of usable privacy and security (UPS), a discipline that combines technical cybersecurity with research involving human participants. Transparency is a key element of scientific integrity—it allows others to assess findings, replicate studies, and understand how results were obtained. Despite its importance, there has been limited systematic analysis of transparency practices in this area.
To fill this gap, the researchers analyzed 200 peer-reviewed UPS papers published between 2018 and 2023 across twelve prominent conferences. Each paper was evaluated using 52 transparency criteria, including the reporting of research questions, completeness of methodological descriptions, availability of study materials, and accessibility of supplementary artifacts such as code or data.
The results show that, on average, papers fulfilled about two-thirds of the applicable transparency criteria. While most studies clearly stated their research goals and described basic procedures, many lacked access to essential materials like questionnaires or interview guides. Inconsistencies were also observed—for example, some papers provided detailed reporting in some areas but omitted key information in others. Online resources were not always reliable, with a noticeable proportion of links leading to unavailable websites or incomplete materials.
Factors influencing transparency included paper length and methodological complexity. Longer papers tended to be more transparent, while studies combining multiple methods were less so, likely due to space limitations. Surprisingly, the presence of an artifact evaluation badge did not significantly correlate with higher transparency.
This research offers a structured overview of current reporting practices in UPS and identifies clear areas for improvement. For society, it supports efforts to make cybersecurity research involving humans more open, replicable, and trustworthy. The findings suggest that targeted changes—such as clearer community guidelines and more robust infrastructure for sharing materials—could strengthen the scientific value and social accountability of UPS research.
Web measurement studies play a key role in understanding privacy and security online. However, they often face serious limitations: many tools used in such studies are custom-built, hard to reuse, and produce results that are difficult to reproduce or compare. This research introduces WebREC, a measurement tool, and the corresponding `.web` archive format, both designed to address these challenges by providing a more reliable and standardized foundation for studying website behavior.
WebREC uses a modified version of the Chromium browser that captures how websites are loaded and executed, including interactions between web pages and JavaScript code. It records not only which resources are loaded, but also how they are used—such as which scripts are run, what actions they trigger, and how pages behave during execution. The `.web` format stores all this information in a structured way, making it possible to reanalyze measurements later without the need to replay and responses.
The study shows that WebREC produces more accurate and reproducible data than existing tools. For instance, it detects all JavaScript calls matching a verified baseline, compared to around 60% for traditional web archive formats. It also captures more reliable information on dynamic web content and offers clearer attribution for network requests—useful for tracking third-party services or analyzing privacy risks.
Importantly, the researchers found that 70% of recent studies in the field could have used WebREC directly without the need for custom crawlers, and nearly 50% could have been conducted using only `.web` archives, without the need to any crawl.
This research evaluates how well Apple AirTags' anti-stalking features protect people from being tracked without their consent. AirTags are small, inexpensive tracking devices originally designed to help users find lost items, but they have also been misused for stalking. To reduce misuse, Apple introduced unwanted tracking notifications and features to help people detect and find unknown AirTags near them. However, the effectiveness of these features had not been thoroughly tested under realistic conditions.
The researchers conducted two studies. In the first, they measured how reliably and quickly tracking notifications were triggered on iOS and Android devices. Results showed that iOS devices were significantly more reliable and timely than Android devices. All iOS users received a warning within a day, while only slightly more than half of Android users did.
In the second study, participants unknowingly carried an AirTag and later received a tracking notification. Their reactions varied: some investigated the alert, others asked friends or family for help, and some ignored it. A common finding was that many users either misunderstood what the warning was trying to communicate or did not perceive it as a threat to them. The notification design, lack of clear threat explanation, and general unfamiliarity with AirTags often led to inaction. When participants were asked to find a hidden AirTag, many struggled with the confusing interface and faint locating sound. Features like the "Find Nearby" function were underused due to unclear labeling.
This research highlights critical shortcomings in current anti-stalking protections. For society, it provides evidence-based recommendations to improve the usability and clarity of these safety features. Better notifications, clearer interfaces, and support for diverse users could help protect more people from technology-facilitated abuse. However, current solutions remain insufficient, particularly for those without access to newer smartphones or with limited technical understanding.
The researchers investigated a new way that attackers can bypass a widely used security mechanism in modern software called Control-Flow Integrity (CFI). CFI is designed to prevent hackers from hijacking how a program executes by restricting its allowed control flow—essentially, keeping the program on known and safe paths. However, programming languages evolve, and with C++20, a new feature called coroutines was introduced to make asynchronous programming easier and more efficient.
Coroutines allow functions to pause and resume their execution, and to support this, they store important execution data—like where to resume and what data to use—in the computer's heap memory. The researchers found that this heap memory is not protected by current CFI methods. That means attackers who can change memory—via common vulnerabilities like buffer overflows—can manipulate coroutine data to reroute how programs run.
They introduced a new attack method called Coroutine Frame-Oriented Programming (CFOP), which exploits these coroutine features to hijack control of a program, even when advanced CFI protections are in place. They demonstrated the attack with real examples, including in a popular database system (ScyllaDB) and an open-source operating system (SerenityOS), both of which use coroutines.
To address the problem, the researchers proposed technical solutions that would make coroutine memory safer, such as moving sensitive pointers out of writable memory.
This research highlights an important gap in existing security protections. As programming tools evolve, so must the security mechanisms that protect them. By identifying this vulnerability early, the study helps developers and compiler makers adapt their systems and better protect future software against such sophisticated attacks.
Many people use ad blockers to protect their privacy and improve their online experience. These tools work by blocking ads and trackers using customizable lists of rules. Some users go a step further and personalize these settings to block even more unwanted content. In this study, we examined whether such personalization could unintentionally harm users' privacy.
We discovered that customizing ad blockers can make users more identifiable on the web. By analyzing which filter lists are active, websites can generate a unique “fingerprint” of a user's ad blocker setup. This is possible even without using JavaScript, a common privacy defense, by relying on subtle web features like CSS. The researchers created new methods that could detect these configurations quietly and efficiently. In their experiments, they were able to uniquely identify many users who had carefully tuned their ad blocker settings, sometimes narrowing their anonymity to a group as small as 48 users among tens of thousands.
The study also showed that these identifying fingerprints remain stable over time, making them useful for tracking individuals across visits. Furthermore, existing tools meant to detect or block such tracking methods are not equipped to recognize these new “scriptless” attacks. The researchers evaluated possible defenses and found that completely standardizing filter lists or disabling personalization would reduce privacy risks—but at the cost of user choice, performance, and usability.
This work sheds light on a complex trade-off: efforts to increase privacy through customization can backfire and reduce anonymity. By rigorously analyzing this risk, we provide important guidance to developers, browser makers, and users. Their findings highlight the need for more nuanced privacy tools that account for both security and usability—contributing to a better-informed conversation about how to protect privacy on the web.
Large language models (LLMs) are increasingly capable of generating text that resembles human language. While this brings many benefits, it also introduces new risks, particularly in the form of hate speech. In this study, the researchers examined how well existing hate speech detectors can identify content produced by LLMs and whether those detectors can withstand attempts to evade them.
To evaluate the situation, the team developed a benchmark framework called HATEBENCH, containing over 7,800 samples of LLM-generated text targeting 34 identity groups. Each sample was carefully labeled by human experts. They then tested eight commonly used hate speech detectors against this dataset. The results showed that while some detectors performed well with earlier LLMs like GPT-3.5, their performance declined significantly with newer models such as GPT-4. This is likely due to more complex and nuanced language used by newer LLMs.
The researchers also demonstrated that malicious actors could manipulate LLM-generated hate speech to avoid detection. These so-called “adversarial attacks” involve making small changes to the wording while keeping the hateful message intact. In some cases, the manipulated messages went undetected over 96% of the time. The study further showed that attackers could improve the efficiency of such attacks by creating local copies of the detectors, an approach that makes automated hate campaigns faster and harder to trace.
This research highlights a growing challenge in online safety. It underscores the need to update hate speech detectors continuously and to develop new tools that are robust against advanced manipulation techniques. While the findings are concerning, the benchmark dataset and methods introduced in this work provide a practical foundation for future improvements in content moderation technologies.
Many studies in computer security use Stack Overflow to analyze programming practices. However, this research relies on cross-sectional studies of Stack Overflow content, i.e., the current state of the platform. In this study, the researchers examined how the ongoing evolution of Stack Overflow content, such as code snippets and comments, affects the reliability of past research results based on that content.
They first reviewed 42 earlier studies that used Stack Overflow to investigate code security, looking at what aspects of Stack Overflow these studies relied on—such as programming languages or how comments provide context to code. They found that many of these studies did not account for how the content might change over time. To test the effects of this evolution, the researchers replicated six of those studies using newer versions of the Stack Overflow dataset.
Their replication efforts revealed that in four of the six cases, the results were significantly different when using more recent data. For example, newer Stack Overflow posts had more code snippets with security issues, and the types of common weaknesses had changed. In some cases, tools built to detect insecure code no longer performed as accurately as before. This means that findings drawn from one version of Stack Overflow data might not hold true later, even just a few years after the original study.
The researchers conclude that Stack Overflow data should be treated as a time series, not a fixed snapshot. For future studies to remain reliable and useful, they recommend analyzing trends over time and placing results in the context of when the data was collected.
This research helps strengthen the scientific practice in cybersecurity by encouraging more careful, time-aware methods. In doing so, it supports the development of tools and insights that remain relevant as platforms like Stack Overflow continue to evolve.
Smartphone theft is a widespread problem, yet little is known about how people prepare for, experience, and respond to it. To address this gap, the researchers conducted detailed interviews with 20 individuals who had recently had their phones stolen. These cases ranged from pickpocketing to armed robbery and occurred in diverse contexts and countries.
The study found that most people are poorly prepared for phone theft. They often rely on basic protections such as screen locks and feel a false sense of security. When theft occurs, they experience shock, helplessness, and fear—particularly around the loss of personal photos and sensitive information like banking data. Many users struggle with regaining control, especially when two-factor authentication relies on the stolen device.
Participants’ first actions typically include trying to track the phone, activating “Lost Mode,” and contacting banks and mobile providers. However, existing recovery processes are often confusing, fragmented, and poorly coordinated. Emotional support usually comes from family or friends, while help from authorities and vendors is perceived as limited. After the incident, many shift from relying on technical solutions to behavioral strategies, such as avoiding risky situations or using cheaper backup phones in public.
The study highlights significant gaps in user guidance, device design, and recovery processes. It suggests that phone vendors, app developers, mobile carriers, and policymakers can do more to support victims—such as offering clearer security instructions, easier account recovery, and coordinated response platforms.
By focusing on real-life experiences rather than hypothetical scenarios, this research helps make the risks of smartphone theft more visible and offers real lived insights to improve both personal preparedness and system-wide support. In doing so, it contributes to better protecting users in an increasingly mobile and connected world.
Modern embedded systems like medical devices, industrial robots, and smart home appliances rely on firmware to function. Ensuring the security of this firmware is essential, but testing it is challenging—especially when the firmware depends on hardware events called interrupts to function properly. If these interrupts are not triggered in the right way during testing, the firmware can crash or behave incorrectly, preventing security bugs from being discovered.
In this study, the researchers developed a new testing tool called AidFuzzer, which improves how firmware is tested by focusing on better handling of interrupts. Previous tools triggered interrupts at fixed intervals or based on fuzz input data, without considering the firmware’s internal state. This often caused errors or missed vulnerabilities. AidFuzzer, in contrast, observes how the firmware runs and only triggers specific interrupts when they are needed. It does this by recognizing whether the firmware is actively processing or waiting for input and by tracking which interrupts can properly cause progress.
The researchers tested AidFuzzer on ten real firmware programs and compared it to existing tools. AidFuzzer found more bugs and reached deeper into the code, identifying eight previously unknown vulnerabilities—some of which were confirmed and reported. It also produced fewer false positives, saving time during analysis.
By improving how interrupts are handled during firmware testing, this research helps make embedded systems more secure. It offers a practical way to detect serious issues in devices that are increasingly part of everyday life, without requiring access to source code or hardware. In doing so, it contributes to the broader goal of making the digital systems we depend on safer and more reliable.
Open-source software (OSS) is a foundation of the digital infrastructure we use every day. Despite its importance, the security practices around OSS—particularly during the design phase—are not well understood. In this study, researchers interviewed 25 OSS developers to understand how they identify and mitigate security threats, with a specific focus on whether they use formal methods known as “threat modeling.”
Threat modeling is a structured process intended to help developers foresee and prevent possible security issues. It is widely recommended but often thought to be too complex or time-consuming, especially in the volunteer-driven world of open-source development. The interviews revealed that almost all participants rely on informal, flexible approaches rather than structured methods. These “ad hoc” practices typically involve thinking about common threats from experience, rather than following formal procedures or documentation standards.
Several reasons for this preference emerged: OSS developers often contribute in their free time and are wary of processes that add overhead or require documentation that must be constantly updated. Many projects are small, decentralized, and lack security experts, making formal approaches feel impractical. Nevertheless, participants did think about threats, often by applying secure design principles or discussing potential issues in online tools like issue trackers.
A few participants did use structured methods—such as STRIDE or attack trees—but these were the exception. Some adapted these methods to fit their needs, skipping formal documentation or using simplified tools like checklists developed by security teams.
This research helps clarify why formal threat modeling is rarely used in OSS and offers suggestions for how structured methods could be made more usable and lightweight. For society, the findings highlight a need to better support the security of OSS projects—many of which underpin critical systems—by making secure design practices more accessible, especially to volunteer developers.
Researchers explored why updating security tools in software is difficult but crucial. Just like you might replace old locks on your doors, software needs to update its "digital locks" (called cryptography) to stay safe from hackers. These updates involve several types: swapping old security methods for stronger ones (like upgrading from SHA-1 to SHA-512), using longer digital keys (from 2048 to 4096 bits), improving communication rules between devices (like moving from TLS 1.2 to 1.3), or preparing for super-powerful future computers that could crack today’s locks (post-quantum cryptography). Unfortunately, many programs still use outdated, vulnerable security because developers struggle with these updates—a problem confirmed by both past studies and recent security breaches. To understand these challenges better, the research team interviewed 21 experienced software developers about their real-world experiences. They discovered developers update security for various reasons (not just hacking threats), but universally find the process complex, time-consuming, and frustrating. Developers often lack clear guidelines or step-by-step processes, facing major hurdles like insufficient security knowledge, outdated systems that block upgrades, and confusing instructions. Most participants expressed a strong need for accessible, jargon-free resources and direct help from security experts to succeed. Based on these findings, the researchers offer practical suggestions for developers, universities, standards groups, and organizations facing the upcoming shift to post-quantum security—all aimed at making essential security updates less daunting and more effective for everyone.
This research examines how text-to-image models, such as Stable Diffusion, can be maliciously manipulated to produce unsafe images—particularly hateful memes—even when users input seemingly harmless prompts like “a photo of a cat.” While prior studies have shown that harmful content can be generated through inappropriate prompts, the researchers here focus on a more proactive and stealthy approach: poisoning the model itself during training to embed harmful behavior.
The study begins by demonstrating that poisoning attacks—where only a few training samples are altered—can lead to models that reliably produce targeted unsafe content in response to specific benign prompts. However, the researchers find that such attacks often result in side effects: even unrelated prompts can trigger harmful outputs, making these modifications easier to detect. They trace the cause of this problem to conceptual similarity between prompts—if two prompts are semantically close, the poisoning can affect both.
To address this, the team proposes a “stealthy poisoning attack” strategy. This method includes not only malicious samples but also “sanitizing” examples to limit the spread of harmful behavior to unintended prompts. They also identify “shortcut” prompts—phrases that resemble the desired harmful content more closely—which allow for more efficient attacks with fewer manipulated samples.
While the research demonstrates the feasibility of such attacks and evaluates them across multiple models and content types, the researchers also discuss mitigation strategies. These include post-generation checks, better model vetting, and fine-tuning with clean data.
From a societal perspective, the study sheds light on a significant and underexplored risk: users may unknowingly produce offensive or harmful content using publicly available AI tools. By identifying this vulnerability and proposing countermeasures, the research contributes to the broader effort to make AI systems more secure and trustworthy.
In this study, the researchers address a growing concern in the development and use of artificial intelligence: the increasing use of synthetic data generated by large language models (LLMs). Synthetic data is often used to reduce the costs and privacy risks of collecting real-world data, especially in areas like healthcare, law, and education. However, such data can also introduce biases, errors, or unintended consequences, particularly when used to train AI systems or produce visual analyses.
To help users and regulators better understand whether a model or result has been influenced by synthetic data, the researchers introduce the concept of “synthetic artifact auditing.” Their goal is to determine whether models (like classifiers or text generators) or outputs (like statistical plots) were trained with or influenced by LLM-generated synthetic data—even without access to the original training datasets.
They propose three methods for this auditing task: metric-based auditing, which uses performance differences between models trained on real versus synthetic data; tuning-based auditing, which relies on internal access to models to detect subtle behavioral patterns; and classification-based auditing for visual outputs. These methods do not require disclosure of proprietary training data and are designed to work under both limited and full access conditions.
The researchers tested their auditing framework across various natural language tasks and found that their methods reliably distinguished between real and synthetic training origins, with high accuracy in both text and visual domains.
From a societal perspective, this research contributes a practical tool for improving transparency and accountability in AI systems. As the use of synthetic data continues to expand, the ability to identify its presence helps ensure more informed oversight and supports responsible deployment of AI technologies.
In this study, the researchers investigate how generative AI models—such as image and language generators—can be misused to carry out privacy and security attacks on other machine learning systems. Unlike traditional attacks, which typically rely on having access to real training data or the inner workings of the target model, the approach presented here works without either. The researchers show that by carefully using publicly available generative AI tools, attackers can generate synthetic data that is realistic enough to launch successful attacks, even in a so-called “black-box” setting where only the model’s outputs are visible.
Three types of attacks were studied: model extraction (rebuilding a copy of a target model), membership inference (determining whether specific data was used to train a model), and model inversion (reconstructing input data based on model outputs). In each case, the researchers developed a multi-step method that starts by generating data using a generative model. This data is then refined using techniques like data augmentation and filtering to better match the target model’s behavior. The refined data is used to simulate the attack.
Across various experiments involving both image and text data, the researchers found that their data-free, black-box approach performs surprisingly well—often nearly matching the effectiveness of conventional attacks that require real data and more access. They also examined how variables like dataset size and data diversity affect attack performance.
From a societal perspective, the research highlights a new type of risk: widely available generative AI models can lower the barrier to launching sophisticated attacks on machine learning systems, even without access to sensitive or proprietary data. This calls for renewed attention to the design of machine learning models and the need for stronger defenses, especially in applications where privacy and trust are essential.
In this study, the researchers investigate how data duplication can be misused to undermine the effectiveness of “machine unlearning”—a process where AI systems are required to forget certain data upon request, often to comply with data protection laws like the GDPR. While previous work has examined unlearning methods and how to verify them, the role of duplicate or near-duplicate data in this context has been largely overlooked.
The researchers propose a novel type of attack in which an adversary inserts duplicate versions of data into the training set of an AI model. Later, the adversary requests the removal of those duplicated entries. However, since the same data still exists elsewhere in the training set, the model may continue to “remember” what it was supposed to forget, even after retraining. This creates a scenario where unlearning appears to succeed but is incomplete—something that can be exploited to falsely accuse the model owner of failing to comply with deletion requests.
To make the attack harder to detect, the team also developed techniques for generating “near-duplicates”—data that functionally resembles the original but appears different enough to bypass standard detection methods. They tested this approach across three AI learning settings: conventional machine learning, federated learning (involving multiple decentralized clients), and reinforcement learning (used in agents like game bots). In all settings, they found that carefully crafted duplicates could significantly reduce unlearning effectiveness, even with modern de-duplication defenses in place.
From a societal perspective, the research exposes a subtle but important vulnerability in how AI systems handle data removal. As unlearning becomes more widely adopted to meet privacy regulations, this work underscores the need for more robust methods to identify and manage duplicate data so that unlearning can be both effective and trustworthy.
The researchers set out to improve our understanding of “membership inference attacks” in situations where only the output labels of a machine learning model are visible. Such attacks try to determine whether a specific data point was used to train a model, which can have serious implications for privacy, especially when sensitive or proprietary data is involved.
Current methods for these so-called "label-only" attacks either require many queries to the model or lack precision, particularly when the model's behavior varies across diverse data. The researchers proposed a new technique, called DHAttack, that reduces the number of queries while improving accuracy. Their method works by measuring how far a data sample has to be modified before the model changes its prediction. Unlike earlier approaches, they calculate this distance in a consistent direction toward a fixed, clearly different data point (such as an entirely white image). This simplifies the measurement and improves reliability.
To further refine their results, they compare the behavior of the target model with that of similar models trained without the specific data sample. This comparison helps them judge whether a sample is likely to have been part of the original training data. Across multiple datasets and model types, their approach achieved better results with significantly fewer queries than existing methods. Even when assumptions were weakened—such as not knowing the exact architecture of the target model—DHAttack remained effective.
This research provides a more practical way to test whether a machine learning model has used certain data, which could support efforts to enforce data privacy and compliance. However, it also highlights the potential for misuse, underlining the importance of developing safeguards alongside technical progress.
The researchers investigated whether modern vision-language models (VLMs)—AI systems that combine image understanding and text generation—are vulnerable to so-called *membership inference attacks*. These attacks aim to determine whether specific data, such as private photos or proprietary datasets, were used to train a model. This type of vulnerability poses a risk to privacy and copyright protection.
Focusing on the sensitive phase of VLM training known as instruction tuning, the researchers developed new methods for inferring whether a given set of images and texts was part of a model’s training data. They introduced a novel approach that examines how VLM outputs change when the model’s “temperature” parameter is adjusted—a factor that influences the randomness of generated responses. They found that data the model was trained on (member data) reacts more strongly to these changes than unseen data (non-member data), offering a new angle for detection.
They tested their methods across different VLM architectures and under various assumptions about the attacker's capabilities. Even in the most challenging cases—where only the images were available, with no text—they were still able to achieve notable inference success. Their findings suggest that attackers could identify whether sensitive data was used in model training, even with minimal information.
From a societal perspective, this research provides a realistic and technical foundation for identifying unauthorized use of data in AI models. While it highlights a potential misuse scenario, it also equips data owners and developers with tools to detect and prevent it. The study emphasizes the importance of transparency and accountability in the training of AI systems and supports ongoing efforts to align AI development with ethical and legal standards.
The researchers investigated whether open-source vision-language models (VLMs) can understand and responsibly handle hateful memes—images that combine text and visuals to spread harmful ideologies. Using a dataset of 39 hateful memes, they evaluated how well seven widely used VLMs interpret such content in terms of visual cues, cultural references, and emotional tone. The models generally performed well, especially when given additional context like the name of the meme, demonstrating a basic ability to comprehend both the surface content and some of the underlying messages.
However, the study uncovered significant safety issues. The models frequently failed to recognize or reject explicitly hateful material and sometimes misinterpreted it as humorous or harmless. In a second part of the study, the researchers explored whether VLMs could be prompted—intentionally or unintentionally—to generate hateful text, such as hate speech, offensive jokes, or slogans. They found that this was indeed possible: around 40% of generated hate speech and over 10% of generated jokes or slogans were rated as harmful. Notably, these outputs often escaped the models’ built-in safety mechanisms, especially when more sophisticated prompting techniques were used.
These findings suggest that while VLMs show technical competence in understanding complex visual content, they currently lack adequate safeguards to prevent misuse. For society, this research highlights a critical need to improve the safety and ethical alignment of such models before they are more widely deployed. Doing so can help ensure that advances in artificial intelligence do not inadvertently amplify harmful content online.
The researchers examined whether fine-tuning large language models (LLMs) with data generated by other LLMs poses fewer privacy risks compared to using real-world data. Their motivation stemmed from the increasing popularity of synthetic data as a privacy-preserving alternative. However, given the sophistication of modern LLMs, the team questioned whether such generated data could still lead to unintended privacy leaks.
To investigate, the researchers fine-tuned various models using two methods: supervised fine-tuning with unstructured, LLM-generated email content, and self-instruct tuning with synthetic legal datasets. They then applied privacy attack techniques, including attempts to extract personally identifiable information (PII) and conduct membership inference attacks (MIA), to assess the resulting risks.
Their findings show that even when using synthetic data, fine-tuned models became more prone to leaking sensitive information. In one case, PII leakage increased by more than 20%, and in another, the success rate of MIA rose by over 40%. These risks were particularly pronounced when the generated data closely resembled the content or structure of the models’ pretraining data.
Moreover, the study found that factors like the size of the dataset, the learning rate, and the quality of the generated data significantly influenced the level of privacy risk. Lower learning rates and more varied data helped reduce, but not eliminate, these risks.
This research offers a valuable caution: synthetic data does not automatically safeguard privacy. It underscores the need for careful design and additional protective measures when fine-tuning language models. By highlighting overlooked vulnerabilities, this work contributes to the broader effort to develop more secure AI systems, especially as reliance on LLMs and synthetic data continues to grow.