Tensor Parallelism¶

cache-dit is also compatible with Tensor Parallelism. Currently, we support the use of Hybrid Cache + Tensor Parallelism scheme (via NATIVE_PYTORCH parallelism backend) in cache-dit. Users can use Tensor Parallelism to further accelerate the speed of inference and reduce the VRAM usage per GPU! For more details, please refer to 📚examples/parallelism. Now, cache-dit supported tensor parallelism for FLUX.1, 🔥FLUX.2, Qwen-Image, Qwen-Image-Lightning, Wan2.1, Wan2.2, HunyuanImage-2.1, HunyuanVideo and VisualCloze, etc. cache-dit will support more models in the future.

from cache_dit import ParallelismConfig

cache_dit.enable_cache(
  pipe_or_adapter, 
  cache_config=DBCacheConfig(...),
  # Set tp_size > 1 to enable tensor parallelism.
  parallelism_config=ParallelismConfig(tp_size=2),
)

L20x1	TP-2	TP-4	+ compile
FLUX, 23.56s	14.61s	10.69s	9.84s

Please note that we have alreay support Hybrid Parallelism (CP/USP + TP) for 💥Large DiT's transformer module. Please refer to Hybrid 2D and 3D Parallelism for more details.