Skip to content

Ascend NPU Support

We are excited to announce that Cache-DiT now provides native support for Ascend NPU. Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies. Please refer to Ascend NPU Supported Matrix for more details.

Features Support

Device Hybrid Cache Context Parallel Tensor Parallel Text Encoder Parallel Auto Encoder(VAE) Parallel
Atlas 800T A2
Atlas 800I A2

Attention backend

Cache-DiT supports multiple Attention backends for better performance. The supported attention backends for Ascend NPU list is as follows:

backend details parallelism attn_mask
native Native SDPA Attention in PyTorch
_native_npu Optimized Ascend NPU Attention
_npu_fia NPU Attention for Ring Parallelism

We strongly recommend using the _native_npu backend to achieve better performance.

Environment Requirements

There are two installation methods:

  • Using pip: first prepare env manually or via CANN image, then install cache-dit using pip.
  • Using docker: use the Ascend NPU community: vllm-ascend pre-built docker image as the base image for cache-dit directly. (Recommended, no need for installing torch and torch_npu manually)

Install NPU SDKs Manually

This section describes how to install NPU environment manually.

Requirements

OS: Linux; Python: >= 3.10, < 3.12; A hardware with Ascend NPU. It's usually the Atlas 800 A2 series; Softwares:

Software Supported version Note
Ascend HDK Refer to here Required for CANN
CANN == 8.3.RC2 Required for cache-dit and torch-npu
torch-npu == 2.8.0 Required for cache-dit
torch == 2.8.0 Required for torch-npu and cache-dit
NNAL == 8.3.RC2 Required for libatb.so, enables advanced tensor operations

Configure CANN environment.

Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to Ascend Environment Setup Guide for more details. To verify that the Ascend NPU firmware and driver were correctly installed, run:

npu-smi info

Please refer to Ascend Environment Setup Guide for more details.

Configure software environment.

The easiest way to prepare your software environment is using CANN image directly. We recommend using the Ascend NPU community: vllm-ascend pre-built docker image as the base image of Ascend NPU for cache-dit. CANN image can be found in Ascend official community website: here. The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the pre-built image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
    --name cache-dit-ascend \
    --shm-size=1g \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

Install PyTorch

If install failed by using pip command, you can get torch-2.8.0+cpu whl file by Link and install manually.

# torch: aarch64
pip3 install torch==2.8.0
# torch: x86
pip3 install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu

Install torch_npu

Strongly recommend install torch_npu by acquire torch_npu-2.8.0*.whl file by Link and install manually. For more detail about Ascend Pytorch Adapter installation, please refer https://gitcode.com/Ascend/pytorch

Install Extra Dependences

pip install --no-deps torchvision==0.16.0 
pip install einops sentencepiece accelerate

Use prebuilt Docker Image

We recommend using the prebuilt image from the Ascend NPU community: vllm-ascend as the base image of Ascend NPU for cache-dit. You can just pull the prebuilt image from the image repository and run it with bash. For example:

# Download pre-built image for Ascend NPU
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

# Use the pre-built image for cache-dit
docker run \
    --name cache-dit-ascend \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    --net=host \
    --shm-size=80g \
    --privileged=true \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /data:/data \
    -itd quay.io/ascend/vllm-ascend:v0.13.0rc1 bash

Ascend Environment variables

# Make sure CANN_path is set to your CANN installation path
# e.g., export CANN_path=/usr/local/Ascend
source $CANN_path/ascend-toolkit/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Set NPU devices by ASCEND_RT_VISIBLE_DEVICES env
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Once it is done, you can start to set up cache-dit.

Install Cache-DiT Library

You can install the stable release of cache-dit from PyPI:

pip3 install -U cache-dit
Or you can install the latest develop version from GitHub:

pip3 install git+https://github.com/vipshop/cache-dit.git
Please also install the latest main branch of diffusers for context parallelism:
pip3 install git+https://github.com/huggingface/diffusers.git # or >= 0.36.0

Exmaples and Benchmark

After the environment configuration is complete, users can refer to the Quick Examples, Ascend NPU Benchmark and Ascend NPU Supported Matrix for more details.

pip3 install opencv-python-headless einops imageio-ffmpeg ftfy 
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latest

Single NPU Inference

The easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single NPU inference. For examples:

# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux --attn _native_npu
python3 -m cache_dit.generate qwen_image --attn _native_npu
python3 -m cache_dit.generate flux --cache --attn _native_npu
python3 -m cache_dit.generate qwen_image --cache --attn _native_npu

Distributed Inference

cache-dit is designed to work Context Parallelism, Tensor Parallelism. For examples:

torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --cache --attn _native_npu