Stable Video 4D 2.0: New Upgrades for High-Fidelity Novel-Views and 4D Generation from a Single Video
Key Takeaways:
We’ve upgraded Stable Video Diffusion 4D (SV4D) to Stable Video 4D 2.0 (SV4D 2.0), delivering higher-quality outputs on real-world video.
Our analysis shows that SV4D 2.0 achieves state-of-the-art results in both 4D generations and novel-view synthesis.
Stable Video 4D 2.0 is now available for both commercial and non-commercial use under the permissive Stability AI Community License.
You can download the multi-view generation models on Hugging Face, find the code on GitHub, and read about the 4D asset reconstruction process on arXiv.
Stable Video 4D 2.0
We’ve upgraded Stable Video Diffusion 4D (SV4D) to Stable Video 4D 2.0 (SV4D 2.0), delivering higher-quality outputs on real-world video. This multi-view video diffusion model is ideal for dynamic 4D asset generation from a single object-centric video. These upgrades make it easier to create dynamic 4D assets for professional production workflows, from generating sprite sheets for in-game characters, to supporting assets for film and virtual worlds.
Multi-view generation remains complex due to the inherent ambiguity of visualizing 3D objects from unseen views. This is especially difficult when subjects are in motion. SV4D 2.0 makes incremental progress toward addressing this challenge by producing consistent, multi-angle outputs without relying on large datasets, multi-camera setups, or preprocessing. While this represents a step forward, occasional artifacts may still appear with dynamic motion.
What’s new
We’ve made multiple upgrades to SV4D 2.0, including:
Sharper and Coherent 4D Outputs: The model was trained in phases, starting with static 3D assets and then adding motion, resulting in clearer and more consistent 4D results.
No Reference Views Required: Works directly from a single video, eliminating the need for multi-view reference images.
Redesigned Network Architecture: Utilizes 3D attention, a mechanism that fuses 3D spatial and temporal features, improving spatio-temporal consistency without relying on reference views.
Improved Real-World Generalization: Performs more consistently on real-world videos. While trained on synthetic data, the model retains world knowledge from pre-trained video models.
Research and benchmarking
Our analysis shows that SV4D 2.0 achieves state-of-the-art results in 4D generation. It ranks first across all major benchmarks: LPIPS (Image fidelity), FVD-V (Multi-view consistency), FVD-F (Temporal coherence), and FV4D (4D consistency). Compared to DreamGaussian4D, L4GM, and SV4D, this version generates sharper and more consistent 4D outputs.
Our analysis also shows that SV4D 2.0 outperforms Diffusion^2, SV3D, and SV4D on novel-view synthesis.The model significantly improves multi-view consistency (FVD-V) and temporal coherence (FVD-F), maintaining high-quality outputs across both changing viewpoints and time. You can read more about the technical advancements of the model in the research paper.
Getting started
Stable Video 4D 2.0 is now available for both commercial and non-commercial use under the permissive Stability AI Community License.
You can download the multi-view generation models on Hugging Face, find the code on GitHub, and read about the 4D asset reconstruction process on arXiv.
To stay updated on our progress, follow us on X, LinkedIn, Instagram, and join our Discord Community.