Soon after launching its most powerful photorealist image generator model, Google has now unveiled its latest text-to-video AI model called VideoPoet. It can not only create videos from text prompts but also edit them seamlessly, like most other video AI models.

VideoPoet, according to Google, can take different forms of input including text to video, image to video, and even video to video. It is also capable of video stylization, video inpainting and outpainting, and video-to-audio, which are available in rival AI models as well. However, unlike competing models, which combine several smaller models, VideoPoet is powered by only a single Large Language Model (LLM).

Behind the scenes, VideoPoet trains an autoregressive language model across video, image, audio, and text modalities using diverse tokenizers, including MAGVIT V2 for video and image processing and SoundStream for audio. This model, once it produces tokens based on a given context, utilizes tokenizer decoders to transform these tokens back into a format that can be viewed and experienced.

The capabilities of VideoPoet extend to generating videos of varying lengths, encompassing a wide spectrum of motions and styles that are influenced by the textual content provided. The system is adept at taking a still image and animating it based on a user’s prompt. It is also capable of predicting optical flow and depth details for video stylization and creating audio accompaniments. Designed with modern content consumption trends in mind, VideoPoet defaults to producing videos in portrait orientation, aiming to optimize its outputs for short-form video content.

Moreover, the AI video generator lets users control camera movements through prompts and it can create video with audio as well. Here is an example of a cat playing a piano.

Google reports that VideoPoet underwent extensive evaluation, being benchmarked against various standards and its outputs compared with those from other models. In these comparisons, users showed a preference for VideoPoet’s creations in 24 to 35% of cases, citing a closer match to the given prompts compared to rival models like Phenaki, VideoCrafter, and Show-1.

The tech giant envisions that VideoPoet could eventually facilitate “any-to-any” generation capabilities. This would potentially include extensions to text-to-audio, audio-to-video, and video captioning applications, among several other functionalities.