Deep learning explainability is essential to understand model predictions. Post hoc attribution methods determine which image pixels most influence a classifier prediction, but are non-robust against imperceptible noise, undermining their trustworthiness. Certified attribution methods should prove that pixel-level importance values are robust; however, prior work only provides image-level bounds, which are too coarse. We introduce a certification approach that guarantees the pixel-level robustness of any black-box attribution method via Randomized Smoothing against l2-bounded input noise. By sparsifying then smoothing attributions, we reformulate the setup into a segmentation certification problem. We propose novel qualitative and quantitative metrics to assess the certified robustness of three families of attribution methods. Visualizing pixel-level certificates nicely complements the visual nature of attribution methods by providing a reliable certified output for downstream tasks. For quantitative evaluation, we introduce two metrics: (i) the percentage of certified pixels, measuring robustness, and (ii) certified localization. Our extensive experiments on ImageNet show high-quality certified outputs across attribution methods, comparing their certified visuals, robustness and localization performance.
International Conference on Machine Learning (ICML)
2025-07
2025-05-23