Skip to content

360CVGroup/WISA

Repository files navigation

WISA

This is the official reproduction of WISA, designed to enhance Text-to-Video models by improving their ability to simulate the real world.

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
Jing Wang*, Ao Ma*, Ke Cao*, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng‡, Yuhui Yin, Xiaodan Liang‡(*Equal Contribution, ‡Corresponding Authors)
arXiv Project Page

📰 News

  • [2025.05.15] 🔥 We are excited to announce the official release of WISA's codebase and model weights on GitHub! This implementation is built upon the powerful finetrainers framework.
  • [2025.03.28] We have uploaded the WISA-80K dataset to Hugging Face, including processed video clips and annotations.
  • [2025.03.12] We have released our paper WISA and created a dedicated project homepage.
Wan2.1-14B WISA Prompt
wan_1.mp4
wisa_wan_1.mp4
A dry clump of soil rests on a flat surface, with fine details of its texture and cracks visible. ...
wan_2.mp4
wisa_wan_2.mp4
The camera focuses on a toothpaste tube on the bathroom countertop. As a finger gently applies...
wan_3.mp4
wisa_wan_3.mp4
A bowl of clear water sits in the center of a freezer. As the temperature gradually drops...

🚀 Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/360CVGroup/WISA.git
cd WISA
conda create -n wisa python=3.10
conda activate wisa
pip install -r requirements.txt

2. Download Pretrained Weights

1. Download Text-to-Video Pretrained Models

Please download CogvideoX and Wan2.1 checkpoints from ModelScope and put it in ./pretrain_models/.

mkdir ./pretrain_models
cd ./pretrain_models
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B-Diffusers --local_dir ./Wan2.1-T2V-14B-Diffusers
modelscope download ZhipuAI/CogVideoX-5b --local_dir ./CogVideoX-5b-Diffusers

2. Download WISA Pretrained Lora and Physical-block Weight

Please download weight from Huggingface and put it in ./pretrain_models/WISA/.

git lfs install
git clone https://huggingface.co/qihoo360/WISA
cd ..

3. Generate Video

You can revise the MODEL_TYPE, GEN_TYPE, PROMPT_PATH, OUTPUT_FILE and LORA_PATH in inference.sh for different inference settings. Then run

sh inference.sh

✨ Training

1. Download WISA-80K

Download the WISA-80K dataset from huggingface.

2. Precomputing Latents and Text Embeddings (Optional)

This project supports precomputing and saving the latent codes of videos and text embeddings to avoid loading the VAE and Text Encoder onto the GPU during training, thereby reducing GPU memory usage. This operation is essential when training Wan2.1-14B; otherwise, it will result in an out-of-memory (OOM) error.

Step 1: you need to add the following parameters to the dataset_cmd in your training script (like examples/training/sft/wan/crush_smol_lora/train_wisa.sh), and ensure you have sufficient storage space available.

dataset_cmd=(
  --dataset_config $TRAINING_DATASET_CONFIG
  --dataset_shuffle_buffer_size 10
  --precomputation_items 2000        # Number of samples to precompute
  --enable_precomputation            # Flag to activate precomputation
  --precomputation_once
  --precomputation_dir ./cache/path  # Directory for cached outputs
  --hash_save                        # Enable hash-based filename storage
  --first_samples
)

Step 2: Configure dataset paths in file examples/training/sft/wan/crush_smol_lora/training_wisa.json and execute

sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh

"Note: Process data in batches to prevent CPU cache overload (recommended maximum: 12,000 samples per batch)."

Step 3: Disable --enable_precomputation flag

dataset_cmd=(
  --dataset_config $TRAINING_DATASET_CONFIG
  --dataset_shuffle_buffer_size 10
  --precomputation_items 2000        # Number of samples to precompute
  # --enable_precomputation            # Flag to activate precomputation
  --precomputation_once
  --precomputation_dir ./cache/path  # Directory for cached outputs
  --hash_save                        # Enable hash-based filename storage
  --first_samples
)

3. Start Training

sh examples/training/sft/wan/crush_smol_lora/train_wisa.sh

Due to quality issues in the validation phase (bug-induced video generation artifacts causing significant deviation from test-phase results), we have disabled validation.

👍 Acknowledgement

This work stands on the shoulders of groundbreaking research and open-source contributions. We extend our deepest gratitude to the authors and contributors of the following projects:

  • CogVideoX - For their pioneering work in video generation
  • Wan2.1 - For their foundational contributions to large-scale video models

Special thanks to the finetrainers framework for enabling efficient model training - your excellent work has been invaluable to this project.

BibTeX

@misc{wang2025wisa,
                title={WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation}, 
                author={Jing Wang and Ao Ma and Ke Cao and Jun Zheng and Zhanjie Zhang and Jiasong Feng and Shanyuan Liu and Yuhang Ma and Bo Cheng and Dawei Leng and Yuhui Yin and Xiaodan Liang},
                year={2025},
                eprint={2502.08153},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2502.08153}, 
}

About

World Simulator Assistant for Physics-Aware Text-to-Video Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •