A curated collection of papers, models, and resources for the field of Video Generation.
Note
This repository is proudly maintained by the frontline research mentors at QuenithAI (应达学术). It aims to provide the most comprehensive and cutting-edge map of papers and technologies in the field of video generation.
Your contributions are also vital—feel free to open an issue or submit a pull request to become a collaborator of this repository. We expect your participation!
If you require expert 1-on-1 guidance on your submissions to top-tier conferences and journals, we invite you to contact us via WeChat or E-mail.
本仓库由 「应达学术」(QuenithAI) 的一线科研导师团队倾力打造并持续维护,旨在为您呈现视频生成领域最全面、最前沿的视频生成领域的论文。
您的贡献对我们和社区来说至关重要——我们诚邀有志之士通过 open an issue 或 submit a pull request 来成为这个项目的合作者之一,期待您的加入!
⚡ Latest Updates
- (Sep 13th, 2025): Add a new direction: 🎯 Reinforcement Learning for Video Generation.
- (Aug 21th, 2025): Add a new direction: 🗣️ Audio-Driven Video Generation.
- (Aug 20th, 2025): Initial commit and repository structure established.
- Controllable Video Generation: A Survey
- Diffusion Model-Based Video Editing: A Survey
- From Sora What We Can See: A Survey of Text-to-Video Generation
- A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
- A Survey on Video Diffusion Models
- Video Diffusion Models: A Survey
- Survey of Video Diffusion Models: Foundations, Implementations, and Applications
- Video Diffusion Generation: Comprehensive Review and Open Problems
-
[CVPR 2025] AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
-
[CVPR 2025] Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
-
[CVPR 2025] Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
-
[CVPR 2025] Identity-Preserving Text-to-Video Generation by Frequency Decomposition
-
[CVPR 2025] Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
-
[CVPR 2025] TransPixeler: Advancing Text-to-Video Generation with Transparency
-
[CVPR 2025] LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
-
[CVPR 2025] Improving Text-to-Video Generation via Instance-aware Structured Caption
-
[CVPR 2025] Compositional Text-to-Video Generation with Blob Video Representations
-
[CVPR 2025] Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
-
[ICCV 2025] T2Bs: Text‑to‑Character Blendshapes via Video Generation
-
[ICCV 2025] Animate Your Word: Bringing Text to Life via Video Diffusion Prior
-
[NeurIPS 2025] Safe‑Sora: Safe Text‑to‑Video Generation via Graphical Watermarking
-
[ICCV 2025] Prompt‑A‑Video: Prompt Your Video Diffusion Model via Preference‑Aligned LLM
-
[ICCV 2025] MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text‑to‑Video Generation
-
[ICCV 2025] TITAN‑Guide: Taming Inference‑Time Alignment for Guided Text‑to‑Video Diffusion Models
-
[ICCV 2025] Video‑T1: Test‑Time Scaling for Video Generation
-
[ICCV 2025] AnimateYourMesh: Feed‑Forward 4D Foundation Model for Text‑Driven Mesh Animation
-
[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation
-
[ICLR 2025] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
-
[ICLR 2025] Pyramidal Flow Matching for Efficient Video Generative Modeling
- LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation
- S²Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
- LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
- Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
- V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
- QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots
- PoseGuard: Pose-Guided Generation with Safety Guardrails
- GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection
- GVD: Guiding Video Diffusion Model for Scalable Video Distillation
- Compositional Video Synthesis by Temporal Object-Centric Learning
- Enhancing Scene Transition Awareness in Video Generation via Post-Training
- Yume: An Interactive World Generation Model
- EndoGen: Conditional Autoregressive Endoscopic Video Generation
- MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
- PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation
- TokensGen: Harnessing Condensed Tokens for Long Video Generation
- Conditional Video Generation for High-Efficiency Video Compression
- Taming Diffusion Transformer for Real-Time Mobile Video Generation
- LoViC: Efficient Long Video Generation with Context Compression
- World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving
- NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
- Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective
- Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
- Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
- Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
- Scaling RL to Long Videos
- PromptTea: Let Prompts Tell TeaCache the Optimal Threshold
- Bridging Sequential Deep Operator Network and Video Diffusion: Residual Refinement of Spatio-Temporal PDE Solutions
- Omni-Video: Democratizing Unified Video Understanding and Generation
- Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
- MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
- Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
- PresentAgent: Multimodal Agent for Presentation Video Generation
- StreamDiT: Real-Time Streaming Text-to-Video Generation
- RefTok: Reference-Based Tokenization for Video Generation
- Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
- Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation
- LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
- LLM-based Realistic Safety-Critical Driving Video Generation
- Geometry-aware 4D Video Generation for Robot Manipulation
- Populate-A-Scene: Affordance-Aware Human Video Generation
- FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
- Epona: Autoregressive Diffusion World Model for Autonomous Driving
- VMoBA: Mixture-of-Block Attention for Video Diffusion Models
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
- Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
- GenHSI: Controllable Generation of Human-Scene Interaction Videos
- SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution
- Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation
- VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
- FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
- RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
- Emergent Temporal Correspondences from Video Diffusion Transformers
- Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition
- FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
- PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
- Causally Steered Diffusion for Automated Video Counterfactual Generation
- VideoMAR: Autoregressive Video Generation with Continuous Tokens
- M4V: Multi-Modal Mamba for Text-to-Video Generation
- GigaVideo‑1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
- DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
- Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
- MagCache: Fast Video Generation with Magnitude-Aware Cache
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models
- Self Forcing: Bridging the Train‑Test Gap in Autoregressive Video Diffusion
- From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models
- Frame Guidance: Training‑Free Guidance for Frame‑Level Control in Video Diffusion Models
- Hi‑VAE: Efficient Video Autoencoding with Global and Detailed Motion
- ContentV: Efficient Training of Video Generation Models with Limited Compute
- Astraea: A GPU‑Oriented Token‑wise Acceleration Framework for Video Diffusion Transformers
- FPSAttention: Training-Aware FP8 and Sparsity Co‑Design for Fast Video Diffusion
- LayerFlow: A Unified Model for Layer‑Aware Video Generation
- FullDiT2: Efficient In‑Context Conditioning for Video Diffusion Transformers
- DenseDPO: Fine‑Grained Temporal Preference Optimization for Video Diffusion Models
- Chipmunk: Training‑Free Acceleration of Diffusion Transformers with Dynamic Column‑Sparse Deltas
- Context as Memory: Scene‑Consistent Interactive Long Video Generation with Memory Retrieval
- CamCloneMaster: Enabling Reference‑based Camera Control for Video Generation
- Dual‑Expert Consistency Model for Efficient and High‑Quality Video Generation
- Sparse‑vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
- LumosFlow: Motion‑Guided Long Video Generation
- Motion aware video generative model
- Many‑for‑Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
- OpenS2V‑Nexus: A Detailed Benchmark and Million‑Scale Dataset for Subject‑to‑Video Generation
- Wan: Open and Advanced Large‑Scale Video Generative Models
-
[CVPR 2024] Make Pixels Dance: High-Dynamic Video Generation
-
[CVPR 2024] VGen: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
-
[CVPR 2024] GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
-
[CVPR 2024] SimDA: Simple Diffusion Adapter for Efficient Video Generation
-
[CVPR 2024] MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
-
[CVPR 2024] Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
-
[CVPR 2024] PEEKABOO: Interactive Video Generation via Masked-Diffusion
-
[CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
-
[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
-
[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
-
[CVPR 2024] Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
[CVPR 2024] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
-
[CVPR 2024] Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation
-
[CVPR 2024] DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
-
[CVPR 2024] Grid Diffusion Models for Text-to-Video Generation
-
[ECCV 2024] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
-
[ECCV 2024] W.A.L.T.: Photorealistic Video Generation with Diffusion Models
-
[ECCV 2024] MoVideo: Motion-Aware Video Generation with Diffusion Models
-
[ECCV 2024] DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
-
[ECCV 2024] MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
-
[ECCV 2024] HARIVO: Harnessing Text-to-Image Models for Video Generation
-
[ECCV 2024] MEVG: Multi-event Video Generation with Text-to-Video Models
-
[NeurIPS 2024] DEMO: Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
-
[ICML 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
[ICLR 2024] VDT: General-purpose Video Diffusion Transformers via Mask Modeling
-
[ICLR 2024] VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation
-
[AAAI 2024] Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
-
[AAAI 2024] E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning
-
[AAAI 2024] ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
-
[AAAI 2024] F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text to-Video Synthesis
- Gender Bias in Text-to-Video Generation Models: A case study of Sora
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
- Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance
- CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training
- DirectorLLM for Human-Centric Video Generation
- Can Video Generation Replace Cinematographers? Research on the Cinematic Language of Generated Video
- LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
- T-SVG: Text-Driven Stereoscopic Video Generation
- Mojito: Motion Trajectory and Intensity Control for Video Generation
- SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
- Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
- STIV: Scalable Text and Image Conditioned Video Generation
- GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
- CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
- MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation
- Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop
- Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
- DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
- InTraGen: Trajectory-controlled Video Generation for Object Interactions
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
- Motion Control for Enhanced Complex Action Video Generation
- GameGen-X: Interactive Open-world Game Video Generation
- Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
- Animating the Past: Reconstruct Trilobite via Video Generation
- ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
- T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
- The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
- Compositional 3D-aware Video Generation with LLM Director
- Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
- FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
- Still-Moving: Customized Video Generation without Customized Video Data
- VEnhancer: Generative Space-Time Enhancement for Video Generation
- Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task
- VIMI: Grounding Video Generation through Multi-modal Instruction
- GVDIFF: Grounded Text-to-Video Generation with Diffusion Models
- Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
- Text-Animator: Controllable Visual Text Video Generation
- MotionBooth: Motion-Aware Customized Text-to-Video Generation
- Hierarchical Patch Diffusion Models for High-Resolution Video Generation
- Compositional Video Generation as Flow Equalization
- MotionClone: Training-Free Motion Cloning for Controllable Video Generation
- VideoTetris: Towards Compositional Text-To-Video Generation
- VideoPhy: Evaluating Physical Commonsense for Video Generation
- I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
- DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control
- The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective
- TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
- MotionMaster: Training-free Camera Motion Transfer For Video Generation
- ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model
- MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
- CameraCtrl: Enabling Camera Control for Text-to-Video Generation
- Grid Diffusion Models for Text-to-Video Generation
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
- S2DM: Sector-Shaped Diffusion Models for Video Generation
- Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
-
[CVPR 2023] Align your Latents: High-resolution Video Synthesis with Latent Diffusion Models
-
[CVPR 2023] Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators
-
[CVPR 2023] Video Probabilistic Diffusion Models in Projected Latent Space
-
[ICCV 2023] PYOCO: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
-
[ICCV 2023] Gen-1: Structure and Content-guided Video Synthesis with Diffusion Models
-
[NeurIPS 2023] UniPi: Learning Universal Policies via Text-Guided Video Generation
-
[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
-
[ICLR 2023] CogVideo: Large-scale Pretraining for Text-to-video Generation via Transformers
-
[ICLR 2023] Make-A-Video: Text-to-video Generation without Text-video Data
-
[ICLR 2023] Phenaki: Variable Length Video Generation From Open Domain Textual Description
- FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- Photorealistic Video Generation with Diffusion Models
- GenTron: Diffusion Transformers for Image and Video Generation
- Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
- ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models
- MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation
- FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
- GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
- Make Pixels Dance: High-Dynamic Video Generation
- VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models
- POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
- VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
- Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
- Dual-Stream Diffusion Net for Text-to-Video Generation
- Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
- Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
- DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
- ControlVideo: Training-free Controllable Text-to-Video Generation
- Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
-
[CVPR 2025] MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
-
[CVPR 2025] MotionPro: A Precise Motion Controller for Image-to-Video Generation
-
[CVPR 2025] Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
-
[CVPR 2025] Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
-
[CVPR 2025] I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models
-
[CVPR 2025] LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
-
[ICCV 2025] AnyI2V: Animating Any Conditional Image with Motion Control
-
[ICCV 2025] Versatile Transition Generation with Image-to-Video Diffusion
-
[ICCV 2025] TIP‑I2V: A Million‑Scale Real Text and Image Prompt Dataset for Image‑to‑Video Generation
-
[ICCV 2025] Unified Video Generation via Next‑Set Prediction in Continuous Domain
-
[NeurIPS 2025] GenRec: Unifying Video Generation and Recognition with Diffusion Models
-
[ICCV 2025] Precise Action‑to‑Video Generation Through Visual Action Prompts
-
[ICCV 2025] STIV: Scalable Text and Image Conditioned Video Generation
-
[ICLR 2025] FrameBridge: Improving Image‑to‑Video Generation with Bridge Models
-
[ICLR 2025] SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
-
[ICLR 2025] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
-
[ICLR 2025] Pyramidal Flow Matching for Efficient Video Generative Modeling
- Physics‑Grounded Motion Forecasting via Equation Discovery for Trajectory‑Guided Image‑to‑Video Generation
- Enhancing Motion Dynamics of Image‑to‑Video Models via Adaptive Low‑Pass Guidance
- Frame In‑N‑Out: Unbounded Controllable Image‑to‑Video Generation
- Dynamic‑I2V: Exploring Image‑to‑Video Generation Models via Multimodal LLM
- Order Matters: On Parameter‑Efficient Image‑to‑Video Probing for Recognizing Nearly Symmetric Actions
- EvAnimate: Event‑Conditioned Image‑to‑Video Generation for Human Animation
- Step‑Video‑TI2V Technical Report: A State‑of‑the‑Art Text‑Driven Image‑to‑Video Generation Model
- DreamInsert: Zero‑Shot Image‑to‑Video Object Insertion from A Single Image
- I2V3D: Controllable image‑to‑video generation with 3D guidance
- Extrapolating and Decoupling Image‑to‑Video Generation Models: Motion Modeling Is Easier Than You Think
- Object‑Centric Image‑to‑Video Generation with Language Guidance
- VidCRAFT3: Camera, Object, and Lighting Control for Image‑to‑Video Generation
- MotionCanvas: Cinematic Shot Design with Controllable Image‑to‑Video Generation
- Through‑The‑Mask: Mask‑based Motion Trajectories for Image‑to‑Video Generation
-
[CVPR 2024] Animate Anyone: Consistent and Controllable Image-to-video Synthesis for Character Animation
-
[CVPR 2024] Your Image Is My Video: Reshaping the Receptive Field via Image-to-Video Differentiable AutoAugmentation and Fusion
-
[CVPR 2024] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
-
[CVPR 2024] Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning
-
[CVPR 2024] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
-
[ECCV 2024] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
-
[ECCV 2024] $\mathrm R2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
-
[ECCV 2024] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
-
[ECCV 2024] Rethinking Image-to-Video Adaptation: An Object-Centric Perspective
-
[NeurIPS 2024] TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation
-
[NeurIPS 2024] Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
-
[ICML 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
[SIGGRAPH 2024] I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
-
[SIGGRAPH 2024] Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
-
[AAAI 2024] Continuous Piecewise-Affine Based Motion Model for Image Animation
- OmniDrag: Enabling Motion Control for Omnidirectional Image‑to‑Video Generation
- CamI2V: Camera‑Controlled Image‑to‑Video Diffusion Model
- Identifying and Solving Conditional Image Leakage in Image‑to‑Video Diffusion Model
- CamCo: Camera‑Controllable 3D‑Consistent Image‑to‑Video Generation
- CamViG: Camera Aware Image‑to‑Video Generation with Multimodal Transformers
- $R^2$‑Tuning: Efficient Image‑to‑Video Transfer Learning for Video Temporal Grounding
- TRIP: Temporal Residual Learning with Image Noise Prior for Image‑to‑Video Diffusion Models
- Your Image is My Video: Reshaping the Receptive Field via Image‑To‑Video Differentiable AutoAugmentation and Fusion
- Tuning‑Free Noise Rectification for High Fidelity Image‑to‑Video Generation
- AtomoVideo: High Fidelity Image‑to‑Video Generation
- ConsistI2V: Enhancing Visual Consistency for Image‑to‑Video Generation
- AIGCBench: Comprehensive Evaluation of Image‑to‑Video Content Generated by AI
-
[CVPR 2025] VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
-
[CVPR 2025] VideoDirector: Precise Video Editing via Text-to-Video Models
-
[CVPR 2025] VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
-
[CVPR 2025] Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing
-
[CVPR 2025] Unity in Diversity: Video Editing via Gradient-Latent Purification
-
[CVPR 2025] VEU-Bench: Towards Comprehensive Understanding of Video Editing
-
[CVPR 2025] SketchVideo: Sketch-based Video Generation and Editing
-
[CVPR 2025] FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
-
[CVPR 2025] Visual Prompting for One-shot Controllable Video Editing without Inversion
-
[CVPR 2025] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
-
[ICCV 2025] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
-
[ICCV 2025] DIVE: Taming DINO for Subject-Driven Video Editing
-
[ICCV 2025] DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors
-
[ICCV 2025] QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing
-
[ICCV 2025] Teleportraits: Training-Free People Insertion into Any Scene
-
[ICLR 2025] VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing
-
[AAAI 2025] FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
-
[AAAI 2025] EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
-
[AAAI 2025] VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment
-
[AAAI 2025] Re-Attentional Controllable Video Diffusion Editing
-
[WACV 2025] IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion
-
[WACV 2025] SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
-
[WACV 2025] MagicStick: Controllable Video Editing via Control Handle Transformations
-
[WACV 2025] Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior
-
[WACV 2025] FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing
- Consistent Video Editing as Flow‑Driven Image‑to‑Video Generation
- UNIC: Unified In‑Context Video Editing
- DreamVE: Unified Instruction‑based Image and Video Editing
- Controllable Pedestrian Video Editing for Multi‑View Driving Scenarios via Motion Sequence
- Low‑Cost Test‑Time Adaptation for Robust Video Editing
- From Long Videos to Engaging Clips: A Human‑Inspired Video Editing Framework with Multimodal Narrative Understanding
- STR‑Match: Matching SpatioTemporal Relevance Score for Training‑Free Video Editing
- Shape‑for‑Motion: Precise and Consistent Video Editing with 3D Proxy
- DFVEdit: Conditional Delta Flow Vector for Zero‑shot Video Editing
- Good Noise Makes Good Edits: A Training‑Free Diffusion‑Based Video Editing with Image and Text Prompts
- LoRA‑Edit: Controllable First‑Frame‑Guided Video Editing via Mask‑Aware LoRA Fine‑Tuning
- TV‑LiVE: Training‑Free, Text‑Guided Video Editing via Layer Informed Vitality Exploitation
- FADE: Frequency‑Aware Diffusion Model Factorization for Video Editing
- FlowDirector: Training‑Free Flow Steering for Precise Text‑to‑Video Editing
- FullDiT2: Efficient In‑Context Conditioning for Video Diffusion Transformers
- OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation
- Motion‑Aware Concept Alignment for Consistent Video Editing
- Zero‑to‑Hero: Zero‑Shot Initialization Empowering Reference‑Based Video Appearance Editing
- REGen: Multimodal Retrieval‑Embedded Generation for Long‑to‑Short Video Editing
- From Shots to Stories: LLM‑Assisted Video Editing with Unified Language Representations
- DAPE: Dual‑Stage Parameter‑Efficient Fine‑Tuning for Consistent Video Editing with Diffusion Models
- Photoshop Batch Rendering Using Actions for Stylistic Video Editing
- Efficient Temporal Consistency in Diffusion‑Based Video Editing with Adaptor Modules: A Theoretical Framework
- Vidi: Large Multimodal Models for Video Understanding and Editing
- Visual Prompting for One‑Shot Controllable Video Editing without Inversion
- CamMimic: Zero‑Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models
- VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
- Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology‑Inspired Computing Methods
- InstructVEdit: A Holistic Approach for Instructional Video Editing
- HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
- VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation
- GIFT: Generated Indoor video frames for Texture‑less point tracking
- RASA: Replace Anyone, Say Anything — A Training‑Free Framework for Audio‑Driven and Universal Portrait Video Editing
- V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes
- Alias‑Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
- VACE: All‑in‑One Video Creation and Editing
- Get In Video: Add Anything You Want to the Video
- VideoPainter: Any‑length Video Inpainting and Editing with Plug‑and‑Play Context Control
- VideoGrain: Modulating Space‑Time Attention for Multi‑grained Video Editing
- VideoDiff: Human‑AI Video Co‑Creation with Alternatives
- SportsBuddy: Designing and Evaluating an AI‑Powered Sports Video Storytelling Tool Through Real‑World Deployment
- AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming and Keyframe Selection
- MotionCanvas: Cinematic Shot Design with Controllable Image‑to‑Video Generation
- SST‑EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
- IP‑FaceDiff: Identity‑Preserving Facial Video Editing with Diffusion
- Qffusion: Controllable Portrait Video Editing via Quadrant‑Grid Attention Learning
- Text‑to‑Edit: Controllable End‑to‑End Video Ad Creation via Multimodal LLMs
- Enhancing Low‑Cost Video Editing with Lightweight Adaptors and Temporal‑Aware Inversion
- Edit as You See: Image‑Guided Video Editing via Masked Motion Modeling
-
[CVPR 2024] A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
-
[CVPR 2024] VidToMe: Video Token Merging for Zero-Shot Video Editing
-
[CVPR 2024] Video-P2P: Video Editing with Cross-Attention Control
-
[CVPR 2024] CCEdit: Creative and Controllable Video Editing via Diffusion Models
-
[CVPR 2024] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
-
[CVPR 2024] DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
-
[CVPR 2024] MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
-
[CVPR 2024] MotionEditor: Editing Video Motion via Content-Aware Diffusion
-
[CVPR 2024] CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-Driven Video Editing
-
[ICLR 2024] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
-
[ICLR 2024] Video Decomposition Prior: Editing Videos Layer by Layer
-
[ICLR 2024] FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
-
[ICLR 2024] TokenFlow: Consistent Diffusion Features for Consistent Video Editing
-
[ECCV 2024] VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
-
[ECCV 2024] WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing
-
[ECCV 2024] DreamMotion: Space-Time Self-similar Score Distillation for Zero-Shot Video Editing
-
[ECCV 2024] Object-Centric Diffusion for Efficient Video Editing
-
[ECCV 2024] Video Editing via Factorized Diffusion Distillation
-
[ECCV 2024] SAVE: Protagonist Diversification with Structure Agnostic Video Editing
-
[ECCV 2024] DNI: Dilutional Noise Initialization for Diffusion Video Editing
-
[ECCV 2024] MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and Editing
-
[ECCV 2024] DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
- MAKIMA: Tuning‑free Multi‑Attribute Open‑domain Video Editing via Mask‑Guided Attention Modulation
- DriveEditor: A Unified 3D Information‑Guided Framework for Controllable Object Editing in Driving Scenes
- Re‑Attentional Controllable Video Diffusion Editing
- MoViE: Mobile Diffusion for Video Editing
- DIVE: Taming DINO for Subject‑Driven Video Editing
- Trajectory Attention for Fine‑grained Video Motion Control
- VideoDirector: Precise Video Editing via Text‑to‑Video Models
- StableV2V: Stablizing Shape Consistency in Video‑to‑Video Editing
- OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
- A Reinforcement Learning‑Based Automatic Video Editing Method Using Pre‑trained Vision‑Language Model
- Taming Rectified Flow for Inversion and Editing
- AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
- Shaping a Stabilized Video by Mitigating Unintended Changes for Concept‑Augmented Video Editing
- RNA: Video Editing with ROI‑based Neural Atlas
- FreeMask: Rethinking the Importance of Attention Masks for Zero‑Shot Video Editing
- DNI: Dilutional Noise Initialization for Diffusion Video Editing
- Blended Latent Diffusion under Attention Control for Real‑World Video Editing
- DeCo: Decoupled Human‑Centered Diffusion Video Editing with Motion Consistency
- InVi: Object Insertion In Videos Using Off‑the‑Shelf Diffusion Models
- MVOC: A Training‑Free Multiple Video Object Composition Method with Diffusion Models
- VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
- COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
- NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
- FRAG: Frequency Adapting Group for Diffusion Video Editing
- Zero‑Shot Video Editing through Adaptive Sliding Score Distillation
- Ada‑VE: Training‑Free Consistent Video Editing Using Adaptive Motion Prior
- Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
- Temporally Consistent Object Editing in Videos using Extended Attention
- MotionFollower: Editing Video Motion via Lightweight Score‑Guided Diffusion
- Streaming Video Diffusion: Online Video Editing with Diffusion Models
- I2VEdit: First‑Frame‑Guided Video Editing via Image‑to‑Video Diffusion Models
- ReVideo: Remake a Video with Motion and Content Control
- Slicedit: Zero‑Shot Video Editing With Text‑to‑Image Diffusion Models Using Spatio‑Temporal Slices
- GenVideo: One‑shot Target‑image and Shape Aware Video Editing using T2I Diffusion Models
- Ctrl‑Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
- S3Editor: A Sparse Semantic‑Disentangled Self‑Training Framework for Face Video Editing
- ExpressEdit: Video Editing with Natural Language and Sketching
- EVA: Zero‑shot Accurate Attributes and Multi‑Object Video Editing
- Edit3K: Universal Representation Learning for Video Editing Components
- Videoshop: Localized Semantic Video Editing with Noise‑Extrapolated Diffusion Inversion
- AnyV2V: A Tuning‑Free Framework For Any Video‑to‑Video Editing Tasks
- DreamMotion: Space‑Time Self‑Similar Score Distillation for Zero‑Shot Video Editing
- EffiVED: Efficient Video Editing via Text‑instruction Diffusion Models
- AICL: Action In‑Context Learning for Video Diffusion Model
- Video Editing via Factorized Diffusion Distillation
- VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
- FastVideoEdit: Leveraging Consistency Models for Efficient Text‑to‑Video Editing
- Place Anything into Any Video
- UniEdit: A Unified Tuning‑Free Framework for Video Motion and Appearance Editing
- Anything in Any Scene: Photorealistic Video Object Insertion
- Object‑Centric Diffusion for Efficient Video Editing
- VASE: Object‑Centric Appearance and Shape Manipulation of Real Videos
- Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
-
[CVPR 2025] IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
-
[CVPR 2025] AnimateAnything: Consistent and Controllable Animation for Video Generation
-
[CVPR 2025] Customized Condition Controllable Generation for Video Soundtrack
-
[CVPR 2025] StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
-
[ICCV 2025] Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
-
[ICCV 2025] MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
-
[ICCV 2025] MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
-
[ICCV 2025] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
-
[ICCV 2025] Free-Form Motion Control (SynFMC): Controlling the 6D Poses of Camera and Objects in Video Generation
-
[ICCV 2025] RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
-
[ICCV 2025] MagicMotion: Video Generation with a Smart Director
-
[ICCV 2025] UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
-
[ICLR 2025] MotionClone: Training-Free Motion Cloning for Controllable Video Generation
-
[AAAI 2025] CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation
-
[AAAI 2025] TrackGo: A Flexible and Efficient Method for Controllable Video Generation
-
[WACV 2025] Fine-grained Controllable Video Generation via Object Appearance and Context
- IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation
- ATI: Any Trajectory Instruction for Controllable Video Generation
- CamContextI2V: Context‑aware Controllable Video Generation
- Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation
- MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
- MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
- Controllable Video Generation with Provable Disentanglement
-
[CVPR 2025] KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
-
[CVPR 2025] AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
-
[CVPR 2025] MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
-
[CVPR 2025] Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
-
[CVPR 2025] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
-
[ICCV 2025] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
-
[ICCV 2025] GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
-
[ICCV 2025] ACTalker: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
-
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
-
[ICLR 2025] Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
-
[ICLR 2025] CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation
-
[AAAI 2025] EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
-
[AAAI 2025] PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis
- Scaling Up Audio‑Synchronized Visual Animation: An Efficient Training Paradigm
- SpA2V: Harnessing Spatial Auditory Cues for Audio‑driven Spatially‑aware Video Generation
- OmniAvatar: Efficient Audio‑Driven Avatar Video Generation with Adaptive Body Animation
- InterActHuman: Multi‑Concept Human Animation with Layout‑Aligned Audio Conditions
- AlignHuman: Improving Motion and Fidelity via Timestep‑Segment Preference Optimization for Audio‑Driven Human Animation
- Audio‑Sync Video Generation with Multi‑Stream Temporal Control
- LLIA — Enabling Low‑Latency Interactive Avatars: Real‑Time Audio‑Driven Portrait Video Generation with Diffusion Models
- TalkingMachines: Real‑Time Audio‑Driven FaceTime‑Style Video via Autoregressive Diffusion Models
-
[CVPR 2024] FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
-
[ECCV 2024] UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model
-
[ECCV 2024] Audio-Driven Talking Face Generation with Stabilized Synchronization Loss
-
[NeurIPS 2024] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
- AV‑Link: Temporally‑Aligned Diffusion Features for Cross‑Modal Audio‑Video Generation
- SAVGBench: Benchmarking Spatially Aligned Audio‑Video Generation
- SINGER: Vivid Audio‑driven Singing Video Generation with Multi‑scale Spectral Diffusion Model
- SyncFlow: Toward Temporally Aligned Joint Audio‑Video Generation from Text
- FLOAT: Generative Motion Latent Flow Matching for Audio‑driven Talking Portrait
- Stereo‑Talker: Audio‑driven 3D Human Synthesis with Prior‑Guided Mixture‑of‑Experts
- A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
- DiffTED: One‑shot Audio‑driven TED Talk Video Generation with Diffusion‑based Co‑speech Gestures
-
[CVPR 2025] X-Dyna: Expressive Dynamic Human Image Animation
-
[CVPR 2025] StableAnimator: High-Quality Identity-Preserving Human Image Animation
-
[CVPR 2025] Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
-
[ICCV 2025] DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
-
[ICCV 2025] Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
-
[ICCV 2025] Multi-identity Human Image Animation with Structural Video Diffusion
-
[ICCV 2025] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
-
[ICCV 2025] AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
-
[ICCV 2025] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation
-
[ICLR 2025] Animate-X: Universal Character Image Animation with Enhanced Motion Representation
- StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation
- HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions
- MTVCrafter: 4D Motion Tokenization for Open‑World Human Image Animation
- TT‑DF: A Large‑Scale Diffusion‑Based Dataset and Benchmark for Human Body Forgery Detection
- AnimateAnywhere: Rouse the Background in Human Image Animation
- UniAnimate‑DiT: Human Image Animation with Large‑Scale Video Diffusion Transformer
- Taming Consistency Distillation for Accelerated Human Image Animation
- Multi‑identity Human Image Animation with Structural Video Diffusion
- DreamActor‑M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
- DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High‑Quality Human Image Animation
- EvAnimate: Event‑conditioned Image‑to‑Video Generation for Human Animation
-
[CVPR 2024] MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion
-
[CVPR 2024] MotionEditor: Editing Video Motion via Content-Aware Diffusion
-
[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
-
[ECCV 2024] Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
-
[NeurIPS 2024] HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
-
[NeurIPS 2024] TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation
-
[ICLR 2024] DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
- DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
- High Quality Human Image Animation using Regional Supervision and Motion Blur Condition
- Dormant: Defending against Pose-driven Human Image Animation
- TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
- VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation
-
[CVPR 2025] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
-
[CVPR 2025] CausVid: From Slow Bidirectional to Fast Autoregressive VDMs
-
[CVPR 2025] BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
-
[ICCV 2025] AdaCache: Adaptive Caching for Faster Video Generation with Diffusion Transformers
-
[ICCV 2025] TaylorSeer: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
-
[ICCV 2025] Accelerating Diffusion Transformer via Gradient-Optimized Cache
-
[ICCV 2025] V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models
-
[ICCV 2025] DMDX: Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
-
[ICCV 2025] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for DiT
-
[ICLR 2025] FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
-
[ICML 2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
-
[ICML 2025] Fast Video Generation with Sliding Tile Attention
-
[ICML 2025] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
-
[ICML 2025] AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- Less is Enough: Training‑Free Video Diffusion Acceleration via Runtime‑Adaptive Caching
- Compact Attention: Exploiting Structured Spatio‑Temporal Sparsity for Fast Video Generation
- MagCache: Fast Video Generation with Magnitude‑Aware Cache
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- SuperGen: An Efficient Ultra‑high‑resolution Video Generation System with Sketching and Tiling
- MixCache: Mixture‑of‑Cache for Video Diffusion Transformer Acceleration
- SwiftVideo: A Unified Framework for Few‑Step Video Generation through Trajectory‑Distribution Alignment
- Taming Diffusion Transformer for Real‑Time Mobile Video Generation
- Sparse‑vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
- SRDiffusion: Accelerate Video Diffusion Inference via Sketching‑Rendering Cooperation
- Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic‑Aware Permutation
- DVD‑Quant: Data‑free Video Diffusion Transformers Quantization
- AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
- Region Masking to Accelerate Video Processing on Neuromorphic Hardware
- DSV: Exploiting Dynamic Sparsity to Accelerate Large‑Scale Video DiT Training
-
[CVPR 2024] Cache Me if You Can: Accelerating Diffusion Models through Block Caching
-
[NeurIPS 2024] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
-
[NeurIPS 2024] Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy
-
[NeurIPS 2024] Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
-
[IJCAI 2024] FasterVD: On Acceleration of Video Diffusion Models
- Accelerating Video Diffusion Models via Distribution Matching
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- OSV: One Step is Enough for High-Quality Image to Video Generation
- HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions
- AnimateDiff-Lightning: Cross-Model Diffusion Distillation
-
[ICCV 2025] LongAnimation: Long Animation Generation with Dynamic Global‑Local Memory
-
[ICLR 2025] DartControl: A Diffusion‑Based Autoregressive Motion Model for Real‑Time Text‑Driven Motion Control
-
[ICLR 2025] FLIP: Flow‑Centric Generative Planning as General‑Purpose Manipulation World Model
-
[CVPR 2025] VideoDPO: Omni‑Preference Alignment for Video Diffusion Generation
- LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
- VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
- EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi‑Modal and Multi‑Task Human Animation
- Video Perception Models for 3D Scene Synthesis
- RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
- VQ‑Insight: Teaching VLMs for AI‑Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
- Toward Rich Video Human‑Motion2D Generation
- AlignHuman: Improving Motion and Fidelity via Timestep‑Segment Preference Optimization for Audio‑Driven Human Animation
- Multimodal Large Language Models: A Survey
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- ContentV: Efficient Training of Video Generation Models with Limited Compute
- Photography Perspective Composition: Towards Aesthetic Perspective Recommendation
- Scaling Image and Video Generation via Test‑Time Evolutionary Search
- InfLVG: Reinforce Inference‑Time Consistent Long Video Generation with GRPO
- AvatarShield: Visual Reinforcement Learning for Human‑Centric Video Forgery Detection
- RLVR‑World: Training World Models with Reinforcement Learning
- Diffusion‑NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models
- DanceGRPO: Unleashing GRPO on Visual Generation
- VideoHallu: Evaluating and Mitigating Multi‑modal Hallucinations on Synthetic Video Understanding
- Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
- SkyReels‑V2: Infinite‑length Film Generative Model
- FingER: Content Aware Fine‑grained Evaluation with Reasoning for AI‑Generated Videos
- Aligning Anime Video Generation with Human Feedback
- Discriminator‑Free Direct Preference Optimization for Video Diffusion
- Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments
- OmniCam: Unified Multimodal Video Generation via Camera Control
- VPO: Aligning Text‑to‑Video Generation Models with Prompt Optimization
- Zero‑Shot Human‑Object Interaction Synthesis with Multimodal Priors
- Judge Anything: MLLM as a Judge Across Any Modality
- MagicID: Hybrid Preference Optimization for ID‑Consistent and Dynamic‑Preserved Video Customization
- Unified Reward Model for Multimodal Understanding and Generation
- Pre‑Trained Video Generative Models as World Simulators
- Harness Local Rewards for Global Benefits: Effective Text‑to‑Video Generation Alignment with Patch‑level Reward Models
- IPO: Iterative Preference Optimization for Text‑to‑Video Generation
- MJ‑VIDEO: Fine‑Grained Benchmarking and Rewarding Video Preferences in Video Generation
- HuViDPO: Enhancing Video Generation through Direct Preference Optimization for Human‑Centric Alignment
- Zeroth‑order Informed Fine‑Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
- Improving Video Generation with Human Feedback
- VisionReward: Fine‑Grained Multi‑Dimensional Human Preference Learning for Image and Video Generation
- OnlineVPO: Align Video Diffusion Model with Online Video‑Centric Preference Optimization
- The Matrix: Infinite‑Horizon World Generation with Real‑Time Moving Control
- Improving Dynamic Object Interactions in Text‑to‑Video Generation with AI Feedback
- Free$^2$Guide: Gradient‑Free Path Integral Control for Enhancing Text‑to‑Video Generation with Large Vision‑Language Models
- A Reinforcement Learning‑Based Automatic Video Editing Method Using Pre‑trained Vision‑Language Model
- Video to Video Generative Adversarial Network for Few‑shot Learning Based on Policy Gradient
- WorldSimBench: Towards Video Generation Models as World Simulators
- Animating the Past: Reconstruct Trilobite via Video Generation
- VideoAgent: Self‑Improving Video Generation
- E‑Motion: Future Motion Simulation via Event Sequence Diffusion
- SePPO: Semi‑Policy Preference Optimization for Diffusion Alignment
- Video Diffusion Alignment via Reward Gradients
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- AdaDiff: Adaptive Step Selection for Fast Diffusion Models
QuenithAI is a professional organization composed of top researchers, dedicated to providing high-quality 1-on-1 research mentoring for university students worldwide. Our mission is to help students bridge the gap from theoretical knowledge to cutting-edge research and publish their work in top-tier conferences and journals.
Maintaining this Awesome Video Generation
list requires significant effort, just as completing a high-quality paper requires focused dedication and expert guidance. If you're looking for one-on-one support from top scholars on your own research project, to quickly identify innovative ideas and make publications, we invite you to contact us ASAP.
➡️ Contact us via WeChat or E-mail to start your research journey.
「应达学术」(QuenithAI) 是一家由顶尖研究者组成,致力于为全球高校学生提供高质量1V1科研辅导的专业机构。我们的使命是帮助学生培养出色卓越的科研技能,在顶级会议和期刊上发表自己的成果。
维护一个GitHub调研仓库需要巨大的精力,正如完成一篇高质量的论文一样,离不开专注的投入和专业的指导。如果您希望在自己的研究项目中,获得来自顶尖学者的一对一支持,我们诚邀您与我们取得联系。
➡️ 欢迎通过 微信 或 邮件 联系我们,开启您的科研之旅。
Contributions are welcome! Please see our Contribution Guidelines for details on how to add new papers, correct information, or improve the repository.
Join our community to stay up-to-date with the latest advancements, share your work, and collaborate with other researchers and developers in the field of video generation.