Ascend NPU Support¶

We are excited to announce that Cache-DiT now provides native support for Ascend NPU. Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies. Please refer to Ascend NPU Supported Matrix for more details.

Features Support¶

Device	Hybrid Cache	Context Parallel	Tensor Parallel	Text Encoder Parallel	Auto Encoder(VAE) Parallel
Atlas 800T A2	✅	✅	✅	✅	✅
Atlas 800I A2	✅	✅	✅	✅	✅

Attention backend¶

Cache-DiT supports multiple Attention backends for better performance. The supported attention backends for Ascend NPU list is as follows:

backend	details	parallelism	attn_mask
native	Native SDPA Attention in PyTorch	✅	✅
_native_npu	Optimized Ascend NPU Attention	✅	✅
_npu_fia	NPU Attention for Ring Parallelism	✅	✅

We strongly recommend using the _native_npu backend to achieve better performance.

Environment Requirements¶

There are two installation methods:

Using pip: first prepare env manually or via CANN image, then install cache-dit using pip.
Using docker: use the Ascend NPU community: vllm-ascend pre-built docker image as the base image for cache-dit directly. (Recommended, no need for installing torch and torch_npu manually)

Install NPU SDKs Manually¶

This section describes how to install NPU environment manually.

Requirements¶

OS: Linux; Python: >= 3.10, < 3.12; A hardware with Ascend NPU. It's usually the Atlas 800 A2 series; Softwares:

Software	Supported version	Note
Ascend HDK	Refer to here	Required for CANN
CANN	== 8.3.RC2	Required for cache-dit and torch-npu
torch-npu	== 2.8.0	Required for cache-dit
torch	== 2.8.0	Required for torch-npu and cache-dit
NNAL	== 8.3.RC2	Required for libatb.so, enables advanced tensor operations

Configure CANN environment.¶

Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to Ascend Environment Setup Guide for more details. To verify that the Ascend NPU firmware and driver were correctly installed, run:

npu-smi info

Please refer to Ascend Environment Setup Guide for more details.

Configure software environment.¶

The easiest way to prepare your software environment is using CANN image directly. We recommend using the Ascend NPU community: vllm-ascend pre-built docker image as the base image of Ascend NPU for cache-dit. CANN image can be found in Ascend official community website: here. The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the pre-built image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
    --name cache-dit-ascend \
    --shm-size=1g \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

Install PyTorch¶

If install failed by using pip command, you can get torch-2.8.0+cpu whl file by Link and install manually.

# torch: aarch64
pip3 install torch==2.8.0
# torch: x86
pip3 install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu

Install torch_npu¶

Strongly recommend install torch_npu by acquire torch_npu-2.8.0*.whl file by Link and install manually. For more detail about Ascend Pytorch Adapter installation, please refer https://gitcode.com/Ascend/pytorch

Install Extra Dependences¶

pip install --no-deps torchvision==0.16.0 
pip install einops sentencepiece accelerate

Use prebuilt Docker Image¶

We recommend using the prebuilt image from the Ascend NPU community: vllm-ascend as the base image of Ascend NPU for cache-dit. You can just pull the prebuilt image from the image repository and run it with bash. For example:

# Download pre-built image for Ascend NPU
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

# Use the pre-built image for cache-dit
docker run \
    --name cache-dit-ascend \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    --net=host \
    --shm-size=80g \
    --privileged=true \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /data:/data \
    -itd quay.io/ascend/vllm-ascend:v0.13.0rc1 bash

Ascend Environment variables¶

# Make sure CANN_path is set to your CANN installation path
# e.g., export CANN_path=/usr/local/Ascend
source $CANN_path/ascend-toolkit/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Set NPU devices by ASCEND_RT_VISIBLE_DEVICES env
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Once it is done, you can start to set up cache-dit.

Install Cache-DiT Library¶

You can install the stable release of cache-dit from PyPI:

pip3 install -U cache-dit

Or you can install the latest develop version from GitHub:

pip3 install git+https://github.com/vipshop/cache-dit.git

Please also install the latest main branch of diffusers for context parallelism:

pip3 install git+https://github.com/huggingface/diffusers.git # or >= 0.36.0

Exmaples and Benchmark¶

After the environment configuration is complete, users can refer to the Quick Examples, Ascend NPU Benchmark and Ascend NPU Supported Matrix for more details.

pip3 install opencv-python-headless einops imageio-ffmpeg ftfy 
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latest

Single NPU Inference¶

The easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single NPU inference. For examples:

# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux --attn _native_npu
python3 -m cache_dit.generate qwen_image --attn _native_npu
python3 -m cache_dit.generate flux --cache --attn _native_npu
python3 -m cache_dit.generate qwen_image --cache --attn _native_npu

Distributed Inference¶

cache-dit is designed to work Context Parallelism, Tensor Parallelism. For examples:

torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --cache --attn _native_npu