This is the official reproduction of WISA, designed to enhance Text-to-Video models by improving their ability to simulate the real world.
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
Jing Wang*, Ao Ma*, Ke Cao*, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng‡, Yuhui Yin, Xiaodan Liang‡(*Equal Contribution, ‡Corresponding Authors)
- [2025.05.15] 🔥 We are excited to announce the official release of WISA's codebase and model weights on GitHub! This implementation is built upon the powerful finetrainers framework.
- [2025.03.28] We have uploaded the WISA-80K dataset to Hugging Face, including processed video clips and annotations.
- [2025.03.12] We have released our paper WISA and created a dedicated project homepage.
Wan2.1-14B | WISA | Prompt |
---|---|---|
wan_1.mp4 |
wisa_wan_1.mp4 |
A dry clump of soil rests on a flat surface, with fine details of its texture and cracks visible. ... |
wan_2.mp4 |
wisa_wan_2.mp4 |
The camera focuses on a toothpaste tube on the bathroom countertop. As a finger gently applies... |
wan_3.mp4 |
wisa_wan_3.mp4 |
A bowl of clear water sits in the center of a freezer. As the temperature gradually drops... |
Clone this repository and install packages.
git clone https://github.com/360CVGroup/WISA.git
cd WISA
conda create -n wisa python=3.10
conda activate wisa
pip install -r requirements.txt
Please download CogvideoX and Wan2.1 checkpoints from ModelScope and put it in ./pretrain_models/
.
mkdir ./pretrain_models
cd ./pretrain_models
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B-Diffusers --local_dir ./Wan2.1-T2V-14B-Diffusers
modelscope download ZhipuAI/CogVideoX-5b --local_dir ./CogVideoX-5b-Diffusers
Please download weight from Huggingface and put it in ./pretrain_models/WISA/
.
git lfs install
git clone https://huggingface.co/qihoo360/WISA
cd ..
You can revise the MODEL_TYPE
, GEN_TYPE
, PROMPT_PATH
, OUTPUT_FILE
and LORA_PATH
in inference.sh
for different inference settings.
Then run
sh inference.sh
Download the WISA-80K dataset from huggingface.
This project supports precomputing and saving the latent codes of videos and text embeddings to avoid loading the VAE and Text Encoder onto the GPU during training, thereby reducing GPU memory usage. This operation is essential when training Wan2.1-14B; otherwise, it will result in an out-of-memory (OOM) error.
Step 1: you need to add the following parameters to the dataset_cmd
in your training script (like examples/training/sft/wan/crush_smol_lora/train_wisa.sh
), and ensure you have sufficient storage space available.
dataset_cmd=(
--dataset_config $TRAINING_DATASET_CONFIG
--dataset_shuffle_buffer_size 10
--precomputation_items 2000 # Number of samples to precompute
--enable_precomputation # Flag to activate precomputation
--precomputation_once
--precomputation_dir ./cache/path # Directory for cached outputs
--hash_save # Enable hash-based filename storage
--first_samples
)
Step 2: Configure dataset paths in file examples/training/sft/wan/crush_smol_lora/training_wisa.json
and execute
sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh
"Note: Process data in batches to prevent CPU cache overload (recommended maximum: 12,000 samples per batch)."
Step 3: Disable --enable_precomputation flag
dataset_cmd=(
--dataset_config $TRAINING_DATASET_CONFIG
--dataset_shuffle_buffer_size 10
--precomputation_items 2000 # Number of samples to precompute
# --enable_precomputation # Flag to activate precomputation
--precomputation_once
--precomputation_dir ./cache/path # Directory for cached outputs
--hash_save # Enable hash-based filename storage
--first_samples
)
sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh
Due to quality issues in the validation phase (bug-induced video generation artifacts causing significant deviation from test-phase results), we have disabled validation.
This work stands on the shoulders of groundbreaking research and open-source contributions. We extend our deepest gratitude to the authors and contributors of the following projects:
- CogVideoX - For their pioneering work in video generation
- Wan2.1 - For their foundational contributions to large-scale video models
Special thanks to the finetrainers framework for enabling efficient model training - your excellent work has been invaluable to this project.
@misc{wang2025wisa,
title={WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation},
author={Jing Wang and Ao Ma and Ke Cao and Jun Zheng and Zhanjie Zhang and Jiasong Feng and Shanyuan Liu and Yuhang Ma and Bo Cheng and Dawei Leng and Yuhui Yin and Xiaodan Liang},
year={2025},
eprint={2502.08153},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.08153},
}