With the advent of text-to-image models and concerns about their misuse, developers are increasingly relying on image safety clas?sifiers to moderate their generated unsafe images. Yet, the perfor?mance of current image safety classifiers remains unknown for both real-world and AI-generated images. In this work, we propose UnsafeBench, a benchmarking framework that evaluates the effec?tiveness and robustness of image safety classifiers, with a particular focus on the impact of AI-generated images on their performance. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough to mitigate the multifaceted problem of unsafe im?ages. Also, there exists a distribution shift between real-world and AI-generated images in image qualities, styles, and layouts, leading to degraded effectiveness and robustness. Motivated by these find?ings, we build a comprehensive image moderation tool called Per?spectiveVision, which improves the effectiveness and robustness of existing classifiers, especially on AI-generated images. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.
ACM Conference on Computer and Communications Security (CCS)
2025-10-15
2025-11-11