HierarchicalPrune

Position-Aware Compression for Large-Scale Diffusion Models

AAAI 2026

1Samsung AI Center-Cambridge, 2Independent Researcher
*Equal contribution
Samsung AI Center
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8

High-resolution image samples generated by our compressed model using HierarchicalPrune, showcasing superior visual quality across various visual styles, precisely following text prompts, and preserving the ability to render typography.

About HierarchicalPrune

Recent large-scale diffusion models such as SD3.5 (8B) and FLUX (11B) deliver outstanding image quality, but their excessive memory and compute demands limit deployment on resource-constrained devices. Existing depth-pruning methods achieve reasonable compression on smaller U-Net-based models, yet fail to scale to these large MMDiT-based architectures without significant quality degradation.

HierarchicalPrune identifies a novel dual hierarchical structure in MMDiT-based diffusion models: an inter-block hierarchy (early blocks establish semantics, later blocks handle refinements) and an intra-block hierarchy (varying importance of subcomponents within each block). It exploits this structure through three principled techniques:

  • Hierarchical Position Pruning (HPP): strategically maintains early blocks that form core image structures while pruning later, less critical blocks
  • Positional Weight Preservation (PWP): freezes non-pruned early portions during distillation, preserving blocks essential for image formation
  • Sensitivity-Guided Distillation (SGDistill): applies inverse distillation weights, assigning minimal updates to the most important (and sensitive) blocks while concentrating learning on less sensitive components

Combined with INT4 quantisation, HierarchicalPrune achieves 77.5–80.4% memory reduction with minimal quality loss (4.8–5.3% user-perceived degradation compared to the original model) with 95% confidence intervals, significantly outperforming prior methods (11.1–52.2% degradation). A user study with 85 participants confirms the superiority of our approach.

Figure 1: HierarchicalPrune Teaser Figure

Method Overview

HierarchicalPrune’s compression framework leverages MMDiT’s two-fold hierarchy (inter-block: early blocks establish semantics, later blocks refine; intra-block: varying subcomponent importance). It comprises (1) HPP, maintaining early blocks while pruning later ones, (2) PWP, freezing critical early blocks during distillation, and (3) SGDistill, applying inverse weights: minimal updates to sensitive blocks and subcomponents. The resulting framework enables effective compression while preserving model capabilities.

Figure 2: HierarchicalPrune Method Overview

Contribution Analysis

We perform fine-grained contribution analysis on SD3.5 Large Turbo to understand the importance of each transformer block and its subcomponents. Beyond conventional full-block removal, we analyse individual and joint subcomponent removal, uncovering an intra-block hierarchy: different subcomponent types exhibit distinct sensitivity patterns, and their pairwise interactions reveal both critical interdependencies and potentially redundant pathways within the MMDiT architecture.

Figure 3: HierarchicalPrune Contribution Analysis 1
Figure 3. Fine-grained contribution analysis of SD3.5 Large Turbo on the HPSv2 dataset by individually removing either an entire MMDiT block or an intra-block subcomponent. We report the performance drop compared to the original model. (a) An entire MMDiT block is removed following prior depth pruning approaches. (b,c,d,e,f) Each subcomponent type of a MMDiT block is removed independently, revealing different patterns of importance across subcomponents.
Figure 4: HierarchicalPrune Contribution Analysis 2
Figure 4. Fine-grained contribution analysis of SD3.5 Large Turbo on the HPSv2 dataset by jointly removing pairs of subcomponent types. The performance drop follows distinct interaction patterns for different pairs, with most combinations leading to significant degradation concentrated in earlier blocks. A notable exception is Context Norm + Context MLP (d), which demonstrates minimal impact.

Visual Comparisons

Side-by-side comparison at 1024×1024 resolution. * Different architecture, shown for reference.

SD3.5 Large
(Original)

BK-SDM

KOALA

Ours

SANA-Sprint*
(Diff. Arch.)

“A painting of a Persian cat dressed as a Renaissance king, standing on a skyscraper overlooking a city.”

“A kangaroo in an orange hoodie and blue sunglasses stands on the grass in front of the Sydney Opera House”

“A digital illustration of a beautiful and alluring American SWAT team in dramatic poses”

“Male character illustration by Gaston Bussiere.”

“A close-up portrait of a beautiful girl with an autumn leaves headdress and melting wax.”

“A smiling man is cooking in his kitchen.”

BibTeX

@inproceedings{kwon2026hierarchicalprune,
    title     = {{HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models}},
    author    = {Kwon, Young D. and Li, Rui and Li, Sijia and Li, Da and Bhattacharya, Sourav and Venieris, Stylianos I.},
    booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
    year      = {2026},
}