Stability AI has announced the launch of the first significant update to its Stable Video Diffusion (SVD) model, now enhanced to version 1.1.
This latest iteration aims to deliver AI-generated videos with improved motion and greater consistency. Continuing its commitment to accessibility, the updated model remains open to the public and is downloadable through Hugging Face. However, commercial usage mandates a Stability AI membership.
In December 2023, the company introduced a subscription-based service for commercial applications of its models, while maintaining open-source access for non-commercial purposes.
The updated model card for SVD 1.1 details its enhancements over the earlier SVD-XT version, capable of generating four-second videos consisting of 25 frames at a resolution of 1024 x 576 pixels.
The Video Diffusion model, developed by Stability AI, is an extension of the Stable Diffusion image model, trained using a meticulously selected dataset of high-quality video content.
The training process was divided into three key stages: initial text-to-image pre-training, followed by video pre-training using an extensive collection of low-resolution videos, and culminating in video fine-tuning with a more compact dataset of high-resolution videos.
Stability AI reports that upon its release, Stable Video Diffusion surpassed top commercial models like RunwayML and Pika Labs in terms of user preference. The company presented generated videos to human evaluators through a web interface, who then assessed the videos based on visual quality and adherence to the given prompts.
Yet, Meta’s latest video model, Emu Video, has surpassed both RunwayML and Pika Labs by a significant margin, making it arguably the top video model at present. However, access to Emu Video is limited to a research paper and a static web demonstration.
The researchers at Stability AI, in their publication, introduced a technique for compiling vast quantities of video data and converting extensive, disorganized video libraries into datasets apt for training generative video models. This method aims to streamline the development of a strong base model for video generation.
Stable Video Diffusion has been crafted for easy adaptation to a range of downstream applications, such as generating multi-view synthesis from a single image through fine-tuning with multi-view datasets. Stability AI aims to create a suite of models that build upon and expand this foundational technology, mirroring the approach taken with Stable Diffusion.