Use Yaml Config File¶
Cache-DiT now supported load the acceleration configs from a custom yaml file. Here are some examples.
Single GPU inference¶
Define a cache only config yaml cache.yaml file that contains:
- DBCache + TaylorSeer
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
- DBCache + TaylorSeer + SCM (Step Computation Mask)
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
# Must set the num_inference_steps for SCM. The SCM will automatically
# generate the steps computation mask based on the num_inference_steps.
num_inference_steps: 28
steps_computation_mask: fast
- DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
num_inference_steps: 28
steps_computation_mask: fast
enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc.
Distributed inference¶
- 1D Parallelism
Define a parallelism only config yaml parallel.yaml file that contains:
ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
- 2D Parallelism
You can also define a 2D parallelism config yaml parallel_2d.yaml file that contains:
tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.
- 3D Parallelism
You can also define a 3D parallelism config yaml parallel_3d.yaml file that contains:
ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
- Ulysses Anything Attention
To enable Ulysses Anything Attention, you can define a parallelism config yaml parallel_uaa.yaml file that contains:
ulysses_anything: true means enabling Ulysses Anything Attention.
>>> import cache_dit
>>> cache_dit.enable_cache(pipe, **cache_dit.load_configs("parallel_uaa.yaml"))
- Ulysses FP8 Communication
For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml parallel_fp8.yaml file that contains:
- Async Ulysses CP
You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml parallel_async.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
# Now, only support for FLUX.1, FLUX.2, Qwen-Image, Ovis-Image,
# Z-Image and LongCat-Image. More models will be added in the future.
ulysses_async: true
ulysses_async: true means enabling async ulysses CP.
- TE-P and VAE-P
You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml parallel_extra.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
Hybrid Cache and Parallelism¶
Define a hybrid cache and parallel acceleration config yaml hybrid.yaml file that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
Attention Backend¶
In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file attention.yaml that only contains:
Quantization¶
You can also specify the quantization config in the yaml file. For example, define a yaml file quantize.yaml that contains:
- quantize transformer
quantize_config: # quantization configuration for transformer modules
# float8_per_row, float8_per_tensor, float8_weight_only, float8_blockwise, int8_per_row, int8_per_tensor, int8_weight_only, int4_weight_only, etc.
quant_type: "float8_per_row"
# layers to exclude from quantization (transformer). layers that contains any of the
# keywords in the exclude_layers list will be excluded from quantization. This is useful
# for some sensitive layers that are not robust to quantization, e.g., embedding layers.
exclude_layers:
- "embedder"
- "embed"
verbose: false # whether to print verbose logs during quantization
Please also enable torch.compile for better performance if you are using quantization.
For SVDQuant W4A4 DQ workflow, you can define a yaml file quantize_svdq.yaml that contains:
# Please install Cache-DiT with SVDQuant support (Experimental) before using the
# SVDQuant quantization config. Installation instructions:
# `uv pip install -U cache-dit-cu13` or `CACHE_DIT_BUILD_SVDQUANT=1 uv pip install -e ".[quantization]"`
quantize_config:
quant_type: "svdq_int4_r128_dq" # or "svdq_nvfp4_r128_dq" for blackwell (sm120, sm121, etc.)
svdq_kwargs:
smooth_strategy: "few_shot"
few_shot_steps: 2
few_shot_auto_compile: true
# Device used for SVD decomposition and W4A4 packing math.
# - "cuda": force CUDA-side SVD + packing, even when float weights are on CPU.
# - "cpu": force CPU-side SVD + packing (slow, for low-memory GPUs).
# - "auto": follow the module's current device (may resolve to CPU if the
# pipeline has not been moved to CUDA yet, e.g. when loaded via config).
#
# IMPORTANT: `few_shot_auto_compile: true` REQUIRES `quantize_device: "cuda"`.
# Using "auto" or "cpu" together with `few_shot_auto_compile` is currently
# NOT supported: when SVD runs on CPU the newly materialised W4A4 weights
# stay on CPU, and while cache-dit attempts to move them to CUDA before
# compiling, the forward-pass inputs may still be on CPU, causing the SVDQ
# W4A4 CUDA kernel to assert. If you need CPU-side SVD for low-memory
# scenarios, set `few_shot_auto_compile: false`, then compile manually after
# moving the pipeline to CUDA.
quantize_device: "cuda"
exclude_layers:
- "embedder"
- "embed"
verbose: false
- fine-grained quantization
You can also specify the quantization config (via components_to_quantize) for different components in the yaml file quantize_extra.yaml that contains:
quantize_config:
components_to_quantize:
transformer:
quant_type: "float8_per_row"
exclude_layers:
- "embedder"
- "embed"
# e.g, specified case for FLUX.1 w/ T5EncoderModel. Please note that we should
# use 'text_encoder' instead of 'text_encoder_2' in most cases, and 'text_encoder_2'
# is only used when there are two text encoders in the pipeline and we only want
# to quantize the second one.
text_encoder_2:
quant_type: "float8_weight_only"
exclude_layers:
- "shared"
- "embed_tokens"
verbose: false
Combined Configs: Cache + Parallelism + Quantization¶
You can also combine all the above configs together in a single yaml file combined.yaml that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
quantize_config:
quant_type: "float8_per_row"
exclude_layers:
- "embedder"
- "embed"
verbose: false
Quick Examples¶
pip install -U uv # use uv for faster installation
uv pip install torch==2.11.0 torchvision torchaudio triton \
transformers diffusers accelerate torchao opencv-python-headless \
einops imageio-ffmpeg ftfy numpy
uv pip install -U cache-dit # stable release from PyPI.
git clone https://github.com/vipshop/cache-dit.git
cd cache-dit/examples/configs # Preset yaml configs for quick test.
python3 -m cache_dit.generate flux --config cache.yaml
python3 -m cache_dit.generate flux --config quantize.yaml --compile
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config hybrid.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel_2d.yaml
torchrun --nproc_per_node=8 -m cache_dit.generate flux --config parallel_3d.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config parallel_usp.yaml
torchrun --nproc_per_node=4 -m cache_dit.generate flux --config combined.yaml --compile