Tensor Parallelism¶
cache-dit is also compatible with Tensor Parallelism. Currently, we support the use of Hybrid Cache + Tensor Parallelism scheme (via NATIVE_PYTORCH parallelism backend) in cache-dit. Users can use Tensor Parallelism to further accelerate the speed of inference and reduce the VRAM usage per GPU! For more details, please refer to 📚examples/parallelism. Now, cache-dit supported tensor parallelism for FLUX.1, 🔥FLUX.2, Qwen-Image, Qwen-Image-Lightning, Wan2.1, Wan2.2, HunyuanImage-2.1, HunyuanVideo and VisualCloze, etc. cache-dit will support more models in the future.
from cache_dit import ParallelismConfig
cache_dit.enable_cache(
pipe_or_adapter,
cache_config=DBCacheConfig(...),
# Set tp_size > 1 to enable tensor parallelism.
parallelism_config=ParallelismConfig(tp_size=2),
)
| L20x1 | TP-2 | TP-4 | + compile |
|---|---|---|---|
| FLUX, 23.56s | 14.61s | 10.69s | 9.84s |
![]() |
![]() |
![]() |
![]() |
Please note that we have alreay support Hybrid Parallelism (CP/USP + TP) for 💥Large DiT's transformer module. Please refer to Hybrid 2D and 3D Parallelism for more details.



