AHFAControl: Adaptive Hierarchical Feature Aggregation for Controllable Diffusion Models
Published in 2026 IEEE International Conference on Multimedia and Expo (ICME), Oral, 2026
Abstract—Controllable diffusion models introduce trainable condition encoders that accept conditioning images as input to enable spatial control in image generation. However, existing methods still face significant challenges in generation quality and control precision. We identify two key limitations: (1) full capacity condition encoders are uniformly allocated across all denoising steps, ignoring the distinct capacity requirements of dif ferent steps; (2) these condition encoders lack multi-level feature communication, causing loss of crucial structural cues. To address these issues, we propose AHFAControl, which automatically allo cates condition encoder capacity across timesteps at runtime, with the optimal allocation strategy identified via evolutionary search. A hierarchical feature aggregation mechanism is also designed to fuse multi-level features from condition encoding blocks before injecting them into the diffusion backbone, providing richer conditional guidance for more precise spatial control. Extensive experiments demonstrate that our method achieves significant improvements in both generation quality and control precision with reduced computational costs compared to other competitive methods.
Recommended citation:
Download Paper | Download Slides | Download Bibtex
