Use Yaml Config File¶

Cache-DiT now supported load the acceleration configs from a custom yaml file. Here are some examples.

Single GPU inference¶

Define a cache only config yaml cache.yaml file that contains:

DBCache + TaylorSeer

cache_config:
  max_warmup_steps: 8 
  warmup_interval: 2  
  max_cached_steps: -1
  max_continuous_cached_steps: 2  
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1

Then, apply the acceleration config from yaml.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("cache.yaml"))

DBCache + TaylorSeer + SCM (Step Computation Mask)

cache_config:
  max_warmup_steps: 8 
  warmup_interval: 2  
  max_cached_steps: -1
  max_continuous_cached_steps: 2  
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
  # Must set the num_inference_steps for SCM. The SCM will automatically 
  # generate the steps computation mask based on the num_inference_steps.
  num_inference_steps: 28 
  steps_computation_mask: fast

DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG

cache_config:
  max_warmup_steps: 8 
  warmup_interval: 2  
  max_cached_steps: -1
  max_continuous_cached_steps: 2  
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
  num_inference_steps: 28
  steps_computation_mask: fast
  enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc.

Distributed inference¶

1D Parallelism

Define a parallelism only config yaml parallel.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  attention_backend: native

Then, apply the distributed inference acceleration config from yaml. ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("parallel.yaml"))

2D Parallelism

You can also define a 2D parallelism config yaml parallel_2d.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  tp_size: 2
  attention_backend: native

Then, apply the 2D parallelism config from yaml. Here tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("parallel_2d.yaml"))

3D Parallelism

You can also define a 3D parallelism config yaml parallel_3d.yaml file that contains:

parallelism_config:
  ulysses_size: 2
  ring_size: 2
  tp_size: 2
  attention_backend: native

Then, apply the 3D parallelism config from yaml. Here ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("parallel_3d.yaml"))

Ulysses Anything Attention

To enable Ulysses Anything Attention, you can define a parallelism config yaml parallel_uaa.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  ulysses_anything: true

Then, apply the config from yaml. Here ulysses_anything: true means enabling Ulysses Anything Attention.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("parallel_uaa.yaml"))

Ulysses FP8 Communication

For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml parallel_fp8.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  ulysses_float8: true

Async Ulysses CP

You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml parallel_async.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  # Now, only support for FLUX.1, FLUX.2, Qwen-Image, Ovis-Image, 
  # Z-Image and LongCat-Image. More models will be added in the future.
  ulysses_async: true

Then, apply the config from yaml. Here ulysses_async: true means enabling async ulysses CP.

TE-P and VAE-P

You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml parallel_extra.yaml file that contains:

parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]

Hybrid Cache and Parallelism¶

Define a hybrid cache and parallel acceleration config yaml hybrid.yaml file that contains:

cache_config:
  max_warmup_steps: 8 
  warmup_interval: 2  
  max_cached_steps: -1
  max_continuous_cached_steps: 2  
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]

Then, apply the hybrid cache and parallel acceleration config from yaml.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("hybrid.yaml"))

Attention Backend¶

In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file attention.yaml that only contains:

attention_backend: "flash" # _flash_3 for Hopper

Then, apply the attention backend config from yaml.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("attention.yaml"))

Quantization¶

You can also specify the quantization config in the yaml file. For example, define a yaml file quantize.yaml that contains:

quantize transformer

quantize_config: # quantization configuration for transformer modules
  # float8_per_row, float8_per_tensor, float8_weight_only, float8_blockwise, int8_per_row, int8_per_tensor, int8_weight_only, int4_weight_only, etc.
  quant_type: "float8_per_row" 
  # layers to exclude from quantization (transformer). layers that contains any of the 
  # keywords in the exclude_layers list will be excluded from quantization. This is useful 
  # for some sensitive layers that are not robust to quantization, e.g., embedding layers.
  exclude_layers:
    - "embedder"
    - "embed"
  verbose: false # whether to print verbose logs during quantization

Then, apply the quantization config from yaml.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("quantize.yaml"))

Please also enable torch.compile for better performance if you are using quantization.

cache_dit.set_compile_configs()
pipe.transformer = torch.compile(pipe.transformer)

For SVDQuant W4A4 DQ workflow, you can define a yaml file quantize_svdq.yaml that contains:

# Please install Cache-DiT with SVDQuant support (Experimental) before using the 
# SVDQuant quantization config. Installation instructions:
# `uv pip install -U cache-dit-cu13` or `CACHE_DIT_BUILD_SVDQUANT=1 uv pip install -e ".[quantization]"`
quantize_config: 
  quant_type: "svdq_int4_r128_dq" # or "svdq_nvfp4_r128_dq" for blackwell (sm120, sm121, etc.)
  svdq_kwargs:
    smooth_strategy: "few_shot"
    few_shot_steps: 2
    few_shot_auto_compile: true
    # Device used for SVD decomposition and W4A4 packing math.
    # - "cuda": force CUDA-side SVD + packing, even when float weights are on CPU.
    # - "cpu":  force CPU-side SVD + packing (slow, for low-memory GPUs).
    # - "auto": follow the module's current device (may resolve to CPU if the
    #   pipeline has not been moved to CUDA yet, e.g. when loaded via config).
    #
    # IMPORTANT: `few_shot_auto_compile: true` REQUIRES `quantize_device: "cuda"`.
    # Using "auto" or "cpu" together with `few_shot_auto_compile` is currently
    # NOT supported: when SVD runs on CPU the newly materialised W4A4 weights
    # stay on CPU, and while cache-dit attempts to move them to CUDA before
    # compiling, the forward-pass inputs may still be on CPU, causing the SVDQ
    # W4A4 CUDA kernel to assert.  If you need CPU-side SVD for low-memory
    # scenarios, set `few_shot_auto_compile: false`, then compile manually after
    # moving the pipeline to CUDA.
    quantize_device: "cuda"
  exclude_layers:  
    - "embedder"
    - "embed"
  verbose: false

fine-grained quantization

You can also specify the quantization config (via components_to_quantize) for different components in the yaml file quantize_extra.yaml that contains:

quantize_config: 
  components_to_quantize:
    transformer:
      quant_type: "float8_per_row"
      exclude_layers:  
        - "embedder"
        - "embed"
    # e.g, specified case for FLUX.1 w/ T5EncoderModel. Please note that we should 
    # use 'text_encoder' instead of 'text_encoder_2' in most cases, and 'text_encoder_2' 
    # is only used when there are two text encoders in the pipeline and we only want 
    # to quantize the second one.
    text_encoder_2:
      quant_type: "float8_weight_only"
      exclude_layers:  
        - "shared" 
        - "embed_tokens"
  verbose: false

Combined Configs: Cache + Parallelism + Quantization¶

You can also combine all the above configs together in a single yaml file combined.yaml that contains:

cache_config:
  max_warmup_steps: 8 
  warmup_interval: 2  
  max_cached_steps: -1
  max_continuous_cached_steps: 2  
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
parallelism_config:
  ulysses_size: auto
  attention_backend: native
  extra_parallel_modules: ["text_encoder", "vae"]
quantize_config: 
  quant_type: "float8_per_row" 
  exclude_layers: 
    - "embedder"
    - "embed"
  verbose: false

Then, apply the combined cache, parallelism and quantization config from yaml.

>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("combined.yaml"))

Please also enable torch.compile for better performance if you are using quantization.

pipe.transformer = torch.compile(pipe.transformer)

Quick Examples¶

pip install -U uv # use uv for faster installation
uv pip install torch==2.11.0 torchvision torchaudio triton \
  transformers diffusers accelerate torchao opencv-python-headless \
  einops imageio-ffmpeg ftfy numpy
uv pip install -U cache-dit # stable release from PyPI.
git clone https://github.com/vipshop/cache-dit.git 
cd cache-dit/examples/configs # Preset yaml configs for quick test.

python3 -m cache_dit.generate flux --config cache.yaml
python3 -m cache_dit.generate flux --config quantize.yaml --compile
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config hybrid.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel_2d.yaml
torchrun --nproc_per_node=8 -m cache_dit.generate flux --config parallel_3d.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel_usp.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config combined.yaml --compile