Ascend NPU Support¶
We are excited to announce that Cache-DiT now provides native support for Ascend NPU. Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies. Please refer to Ascend NPU Supported Matrix for more details.
Features Support¶
| Device | Hybrid Cache | Context Parallel | Tensor Parallel | Text Encoder Parallel | Auto Encoder(VAE) Parallel |
|---|---|---|---|---|---|
| Atlas 800T A2 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Atlas 800I A2 | ✅ | ✅ | ✅ | ✅ | ✅ |
Attention backend¶
Cache-DiT supports multiple Attention backends for better performance. The supported attention backends for Ascend NPU list is as follows:
| backend | details | parallelism | attn_mask |
|---|---|---|---|
| native | Native SDPA Attention in PyTorch | ✅ | ✅ |
| _native_npu | Optimized Ascend NPU Attention | ✅ | ✅ |
| _npu_fia | NPU Attention for Ring Parallelism | ✅ | ✅ |
We strongly recommend using the _native_npu backend to achieve better performance.
Environment Requirements¶
There are two installation methods:
- Using pip: first prepare env manually or via CANN image, then install
cache-ditusing pip. - Using docker: use the Ascend NPU community: vllm-ascend pre-built docker image as the base image for cache-dit directly. (Recommended, no need for installing torch and torch_npu manually)
Install NPU SDKs Manually¶
This section describes how to install NPU environment manually.
Requirements¶
OS: Linux; Python: >= 3.10, < 3.12; A hardware with Ascend NPU. It's usually the Atlas 800 A2 series; Softwares:
| Software | Supported version | Note |
|---|---|---|
| Ascend HDK | Refer to here | Required for CANN |
| CANN | == 8.3.RC2 | Required for cache-dit and torch-npu |
| torch-npu | == 2.8.0 | Required for cache-dit |
| torch | == 2.8.0 | Required for torch-npu and cache-dit |
| NNAL | == 8.3.RC2 | Required for libatb.so, enables advanced tensor operations |
Configure CANN environment.¶
Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to Ascend Environment Setup Guide for more details. To verify that the Ascend NPU firmware and driver were correctly installed, run:
Please refer to Ascend Environment Setup Guide for more details.
Configure software environment.¶
The easiest way to prepare your software environment is using CANN image directly. We recommend using the Ascend NPU community: vllm-ascend pre-built docker image as the base image of Ascend NPU for cache-dit. CANN image can be found in Ascend official community website: here. The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the pre-built image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
--name cache-dit-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
Install PyTorch¶
If install failed by using pip command, you can get torch-2.8.0+cpu whl file by Link and install manually.
# torch: aarch64
pip3 install torch==2.8.0
# torch: x86
pip3 install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu
Install torch_npu¶
Strongly recommend install torch_npu by acquire torch_npu-2.8.0*.whl file by Link and install manually. For more detail about Ascend Pytorch Adapter installation, please refer https://gitcode.com/Ascend/pytorch
Install Extra Dependences¶
Use prebuilt Docker Image¶
We recommend using the prebuilt image from the Ascend NPU community: vllm-ascend as the base image of Ascend NPU for cache-dit. You can just pull the prebuilt image from the image repository and run it with bash. For example:
# Download pre-built image for Ascend NPU
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
# Use the pre-built image for cache-dit
docker run \
--name cache-dit-ascend \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--net=host \
--shm-size=80g \
--privileged=true \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /data:/data \
-itd quay.io/ascend/vllm-ascend:v0.13.0rc1 bash
Ascend Environment variables¶
# Make sure CANN_path is set to your CANN installation path
# e.g., export CANN_path=/usr/local/Ascend
source $CANN_path/ascend-toolkit/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Set NPU devices by ASCEND_RT_VISIBLE_DEVICES env
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Once it is done, you can start to set up cache-dit.
Install Cache-DiT Library¶
You can install the stable release of cache-dit from PyPI:
Or you can install the latest develop version from GitHub: Please also install the latest main branch of diffusers for context parallelism:Exmaples and Benchmark¶
After the environment configuration is complete, users can refer to the Quick Examples, Ascend NPU Benchmark and Ascend NPU Supported Matrix for more details.
pip3 install opencv-python-headless einops imageio-ffmpeg ftfy
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latest
Single NPU Inference¶
The easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single NPU inference. For examples:
# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux --attn _native_npu
python3 -m cache_dit.generate qwen_image --attn _native_npu
python3 -m cache_dit.generate flux --cache --attn _native_npu
python3 -m cache_dit.generate qwen_image --cache --attn _native_npu
Distributed Inference¶
cache-dit is designed to work Context Parallelism, Tensor Parallelism. For examples:
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --cache --attn _native_npu