Skip to content

Tensor Parallelism

cache-dit is also compatible with Tensor Parallelism. Currently, we support the use of Hybrid Cache + Tensor Parallelism scheme (via NATIVE_PYTORCH parallelism backend) in cache-dit. Users can use Tensor Parallelism to further accelerate the speed of inference and reduce the VRAM usage per GPU! For more details, please refer to 📚examples/parallelism. Now, cache-dit supported tensor parallelism for FLUX.1, 🔥FLUX.2, Qwen-Image, Qwen-Image-Lightning, Wan2.1, Wan2.2, HunyuanImage-2.1, HunyuanVideo and VisualCloze, etc. cache-dit will support more models in the future.

from cache_dit import ParallelismConfig

cache_dit.enable_cache(
  pipe_or_adapter, 
  cache_config=DBCacheConfig(...),
  # Set tp_size > 1 to enable tensor parallelism.
  parallelism_config=ParallelismConfig(tp_size=2),
)
L20x1 TP-2 TP-4 + compile
FLUX, 23.56s 14.61s 10.69s 9.84s

Please note that we have alreay support Hybrid Parallelism (CP/USP + TP) for 💥Large DiT's transformer module. Please refer to Hybrid 2D and 3D Parallelism for more details.