Novel diffusion models (DMs) can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of DMs’ parameters contained in attention layers influence the generation of textual content within the images. Building on this observation, by precisely targeting cross and joint attention layers of DMs, we improve the efficiency and performance of textual generation. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large DMs while preserving the quality and diversity of the DMs’ generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP and the large language models like T5).
International Conference on Learning Representations (ICLR)
2025-02-14
2025-02-26