DBCache Design

⚡️DBCache: Dual Block Cache¶

DBCache: Dual Block Caching for Diffusion Transformers. We have enhanced FBCache into a more general and customizable cache algorithm, namely DBCache, enabling it to achieve fully UNet-style cache acceleration for DiT models. Different configurations of compute blocks (F8B12, etc.) can be customized in DBCache. Moreover, it can be entirely training-free. DBCache can strike a perfect balance between performance and precision!

DBCache, L20x1 , Steps: 28, "A cat holding a sign that says hello world with complex background"

Baseline(L20x1)	F1B0 (0.08)	F1B0 (0.20)	F8B8 (0.15)	F12B12 (0.20)
24.85s	15.59s	8.58s	15.41s	15.11s

Baseline(L20x1)	F1B0 (0.08)	F8B8 (0.12)	F8B12 (0.12)	F8B16 (0.20)
27.85s	6.04s	5.88s	5.77s	6.01s

DBCache, L20x4 , Steps: 20, case to show the texture recovery ability of DBCache

These case studies demonstrate that even with relatively high thresholds (such as 0.12, 0.15, 0.2, etc.) under the DBCache F12B12 or F8B16 configuration, the detailed texture of the kitten's fur, colored cloth, and the clarity of text can still be preserved. This suggests that users can leverage DBCache to effectively balance performance and precision in their workflows!

DBCache provides configurable parameters for custom optimization, enabling a balanced trade-off between performance and precision:

Fn: Specifies that DBCache uses the first n Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks.
Bn: Further fuses approximate information in the last n Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache.

max_warmup_steps: (default: 0) DBCache does not apply the caching strategy when the number of running steps is less than or equal to this value, ensuring the model sufficiently learns basic features during warmup.
max_cached_steps: (default: -1) DBCache disables the caching strategy when the previous cached steps exceed this value to prevent precision degradation.
residual_diff_threshold: The value of residual diff threshold, a higher value leads to faster performance at the cost of lower precision.

For a good balance between performance and precision, DBCache is configured by default with F8B0, 8 warmup steps, and unlimited cached steps.

import cache_dit
from diffusers import FluxPipeline

pipe_or_adapter = FluxPipeline.from_pretrained(
  "black-forest-labs/FLUX.1-dev",
  torch_dtype=torch.bfloat16,
).to("cuda")

# Default options, F8B0, 8 warmup steps, and unlimited cached 
# steps for good balance between performance and precision
cache_dit.enable_cache(pipe_or_adapter)

# Custom options, F8B8, higher precision
from cache_dit import DBCacheConfig

cache_dit.enable_cache(
  pipe_or_adapter,
  cache_config=DBCacheConfig(
    max_warmup_steps=8,  # steps do not cache
    max_cached_steps=-1, # -1 means no limit
    Fn_compute_blocks=8, # Fn, F8, etc.
    Bn_compute_blocks=8, # Bn, B8, etc.
    residual_diff_threshold=0.12,
  ),
)

⚡️Hybrid Cache CFG¶

cache-dit supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG (classifier-free guidance) in the forward step, please set enable_separate_cfg param to False (default). Otherwise, set it to True. For examples:

from cache_dit import DBCacheConfig

cache_dit.enable_cache(
  pipe_or_adapter, 
  cache_config=DBCacheConfig(
    ...,
    # CFG: classifier free guidance or not
    # For model that fused CFG and non-CFG into single forward step,
    # should set enable_separate_cfg as False. For example, set it as True 
    # for Wan 2.1/Qwen-Image and set it as False for FLUX.1, HunyuanVideo, 
    # CogVideoX, Mochi, LTXVideo, Allegro, CogView3Plus, EasyAnimate, SD3, etc.
    enable_separate_cfg=True, # Wan 2.1, Qwen-Image, CogView4, Cosmos, SkyReelsV2, etc.
    # Compute cfg forward first or not, default False, namely, 
    # 0, 2, 4, ..., -> non-CFG step; 1, 3, 5, ... -> CFG step.
    cfg_compute_first=False,
    # Compute separate diff values for CFG and non-CFG step, 
    # default True. If False, we will use the computed diff from 
    # current non-CFG transformer step for current CFG step.
    cfg_diff_compute_separate=True,
  ),
)