Meta AI Text-to-Video Generator – Everything You Must Know

Introduction: Beyond "Make-A-Video"
The global landscape of generative artificial intelligence has undergone a seismic shift, transitioning from the static generation of text and imagery to the dynamic, temporally complex domain of video synthesis. While 2022 and 2023 were defined by the rapid proliferation of Large Language Models (LLMs) and latent diffusion models for image creation, the current frontier is undeniably video. Within this high-stakes technological arms race, Meta Platforms Inc. has reasserted its technical and strategic dominance with the introduction of "Movie Gen," a suite of media foundation models that fundamentally redefines the capabilities of synthetic media.
To fully appreciate the magnitude of the Movie Gen announcement, it is essential to contextualize it within the broader trajectory of Meta’s artificial intelligence research. In late 2022, Meta unveiled "Make-A-Video," a pioneering research project that demonstrated the feasibility of text-to-video generation. Make-A-Video was a significant milestone, proving that the semantic understanding of CLIP-like models could be extended to temporal dimensions. However, it was emblematic of "first-wave" AI video: characterized by short durations, low frame rates, a lack of synchronized audio, and the visual artifacts—such as shimmering textures and morphing objects—inherent to early diffusion architectures. It was, in essence, a proof of concept; a research artifact signaling intent rather than a product capable of serving the rigorous demands of a social ecosystem inhabited by billions of daily users.
October 2024 marked the arrival of Meta’s "third wave" of generative AI research, culminating in the release of the Movie Gen research paper and technical showcase. Movie Gen is not merely an iterative update to Make-A-Video; it represents a holistic "cast" of foundation models designed to address the entire pipeline of video production. This suite integrates high-fidelity text-to-video generation with three critical capabilities that were previously disparate or non-existent in competitor offerings: personalized video generation that preserves human identity, precise instruction-based video editing, and synchronized high-fidelity audio generation.
The significance of Movie Gen extends far beyond its feature set. It signals a fundamental architectural departure from the standard diffusion models that powered the generative AI boom of the early 2020s. By adopting a "Flow Matching" training objective over traditional diffusion, Meta has embraced a more efficient, temporally consistent, and scientifically robust method of simulating reality. Furthermore, Movie Gen is designed not as a standalone tool for a niche group of filmmakers, but as an engine for "Social Cinema"—a democratized creative capability intended to be integrated directly into the platforms where users already reside: Instagram, Facebook, and Messenger. This report provides an exhaustive analysis of Meta Movie Gen, dissecting its technical architecture, evaluating its market position against competitors like OpenAI’s Sora and Runway Gen-3, and exploring the profound implications for content creators, developers, and the digital ecosystem at large.
The Evolution from Make-A-Video to Movie Gen
The technological leap from Make-A-Video to Movie Gen illustrates the rapid maturation of generative AI research. The 2022 Make-A-Video model relied on a U-Net architecture and utilized a method of learning motion dynamics from video data without necessarily requiring paired text for all training examples. This allowed for the generation of motion but often resulted in a disconnect between the user's specific prompt and the resulting visual dynamics. It was a system that could "hallucinate" movement, but often struggled with the physics of that movement.
In contrast, Movie Gen is built upon a massive 30-billion parameter transformer architecture that has been jointly optimized for both text-to-image and text-to-video generation. This "joint training" strategy is a critical innovation. By treating static images and dynamic videos as part of a continuous spectrum of visual data, the model learns to reason about the visual world with significantly greater fidelity. It understands that a "cat" in a static image and a "cat jumping" in a video are the same semantic entity subject to different temporal states. This shared representation allows the model to leverage the vast abundance of high-quality image data to improve the visual fidelity of its video frames, while using video data to learn the temporal dynamics.
The scale of compute and data involved in Movie Gen dwarfs that of its predecessor. The models were pre-trained on a massive corpus of licensed and publicly available datasets, allowing them to ingest a vast vocabulary of visual and auditory concepts. This scaling addresses the "silent movie" problem that plagued Make-A-Video. While the 2022 model was strictly visual, Movie Gen introduces a dedicated 13-billion parameter audio model capable of generating synchronized Foley (sound effects), ambient environments, and musical scores, creating a cohesive multimedia experience that Make-A-Video could not approach.
Strategically, the evolution reflects a shift from research to product readiness. Make-A-Video was a demonstration of possibility. Movie Gen is engineered with specific product-market fit in mind, particularly regarding the "selfie" mode and instruction-based editing. These features map directly to the behaviors of Instagram and Facebook users who modify and share personal content, rather than just generating generic stock footage. It positions Movie Gen as a "Social Cinema" engine, designed to lower the barrier to entry for high-end video production and integrate it seamlessly into the daily digital lives of consumers.
What is Meta Movie Gen? The 4 Core Capabilities
Meta Movie Gen is distinguished by its integration. Rather than offering a single model for video generation and relying on third-party tools for sound or editing, Meta has developed a suite of four distinct yet interoperable capabilities. These capabilities address the full lifecycle of video production: creation, customization, modification, and sonification. This integrated approach is a key differentiator in a market where competitors often specialize in only one modality.
Text-to-Video Generation
At the core of the Movie Gen ecosystem is the text-to-video generation model. This is a 30-billion parameter transformer trained with a maximum context length of 73,000 video tokens. To understand the scale, 30 billion parameters places this model among the largest vision models publicly disclosed, allowing it to store and retrieve an immense amount of "world knowledge"—understanding not just the appearance of objects, but their physical properties, interactions, and movement dynamics.
Technical Specifications and Capabilities: The model is capable of generating high-definition (1080p) videos of up to 16 seconds in duration at a frame rate of 16 frames per second (fps). While 16 fps is lower than the standard 24 fps used in cinema or 30/60 fps used in broadcast and gaming, this is a strategic optimization. It balances the immense computational load of generating high-resolution frames with the need for visual fluidity. In production pipelines, frame interpolation techniques (often AI-driven) are commonly used to smooth 16 fps output to higher frame rates for final delivery.
Crucially, the model supports multiple aspect ratios natively. Unlike early diffusion models that were often constrained to square (1:1) outputs, Movie Gen can generate in 9:16 (vertical, optimized for Reels and TikTok) and 16:9 (cinematic landscape). This is not a post-processing crop; the model generates the composition natively, ensuring that subjects are framed correctly regardless of the chosen format.
Reasoning and Physics: A defining characteristic of the 30B parameter model is its ability to reason about object motion and subject-object interactions. It goes beyond simple aesthetic coherence to simulate complex physical phenomena. The research highlights capabilities such as simulating the splashing of water, the swaying of fabric in the wind, or the interactions between animals and their environments. This "physics awareness"—while not a true physics engine in the simulation sense—suggests that the model has learned statistical correlations that closely approximate real-world physics, reducing the "dream-like" logic that often breaks immersion in AI video.
Personalized Video (The "Selfie" Feature)
If the text-to-video model is the engine, the personalized video capability is the steering wheel designed for the social media user. In the context of platforms like Instagram and Facebook, users are primarily interested in content that features themselves and their social circles. Generic stock footage, no matter how high-quality, has limited utility for a personal Story or a birthday greeting.
Identity Preservation: Movie Gen introduces a robust personalization engine that allows users to upload a single photo of a person’s face and animate them into any scene described by a text prompt. This differs fundamentally from a simple "face swap" or deepfake overlay. The model conditions the generation on both the text prompt and the reference image, preserving the individual's identity, facial structure, and key features while generating entirely new lighting, expressions, and motion compatible with the new environment.
Complex Modifications: The capabilities extend far beyond simple animation. The model can handle substantial deviations from the source image. A user can upload a static photo of themselves sitting in a generic office and generate a video of themselves exploring a jungle, wearing a spacesuit, or painting a canvas. The model maintains the "essence" or identity of the person while fully synthesizing the new context, pose, and action. This requires the model to have a deep disentanglement of "identity" features from "pose" and "lighting" features.
Differentiation Strategy: This feature is a critical competitive differentiator. OpenAI’s Sora, as of late 2024 and early 2025, has focused primarily on creating new characters or generic subjects, likely due to the immense safety challenges associated with generating real people. Meta’s decision to prioritize this feature suggests they have developed specific safeguards (discussed in the Safety section) to allow for this high-value use case, positioning Movie Gen as a tool for "self-expression" rather than just "content generation".
Precise Video Editing
Generative video has historically suffered from a lack of control, often referred to as the "slot machine" problem. If a user generated a video that was 90% perfect but had one flaw—such as a glitchy hand or the wrong shirt color—they would typically have to regenerate the entire clip with a new seed, likely losing the elements they liked. Movie Gen addresses this with a dedicated, instruction-based video editing module.
Instruction-Based Editing: Users can provide an existing video (either real footage captured on a phone or AI-generated content) and a text command to modify specific elements. The research paper and demos highlight examples such as "put a pom-pom in the man's hand," "change the background to a desert," or "dress the penguins in Victorian outfits". This natural language interface democratizes video editing, removing the need for complex masking or rotoscoping skills.
Pixel-Level Precision: The technical achievement here is the model's ability to perform "localized editing." The model utilizes attention mechanisms to identify exactly which pixels correspond to the subject of the edit (e.g., the man's hand) and which pixels must remain frozen (e.g., the man's face and the background). This ensures that the identity of the subject or the integrity of the scene is preserved while the edit is applied. This capability moves AI video from a novelty to a production-ready tool, allowing for iterative refinement—a workflow essential for professional creators.
Stylization and Global Edits: Beyond object manipulation, the editing model can perform global style transfers. It can transform a realistic video into a cartoon, an oil painting, or a line drawing, maintaining temporal consistency across frames. This builds upon Meta's earlier "Emu" research but applies it with the temporal stability of Flow Matching, preventing the flickering that often accompanies frame-by-frame style transfer.
Audio Generation
The "silent era" of AI video is effectively ended by Movie Gen’s audio component. Visuals alone, no matter how realistic, often fail to pass the "uncanny valley" test without corresponding auditory cues. The disconnect between a visual crash and silence breaks the illusion of reality.
13B Parameter Audio Model: Meta has trained a substantial 13-billion parameter model dedicated solely to audio generation. To put this in perspective, this audio model alone is larger than many entire language models, indicating the complexity involved in generating high-fidelity, high-sample-rate (48kHz) sound.
Synchronization and Foley: The model takes the video as input and generates synchronized sound effects, known in the film industry as Foley. If a video shows a horse galloping, the audio model generates the rhythmic thud of hooves hitting the ground, synchronized to the visual impact. If a person is splashing water, the sound matches the intensity and timing of the splash. This synchronization is achieved through the model's ability to attend to visual motion cues and map them to audio events.
Ambient and Musical Score:
In addition to diegetic sounds (sounds occurring within the story world), the model can generate non-diegetic background music that matches the mood of the prompt. An "epic" video prompt triggers an orchestral swell; a "melancholic" prompt generates a slow piano track. The model can professionally blend these elements—sound effects, ambient noise, and music—into a single audio track.
Audio Extension: A key technical innovation is the "audio extension technique," which allows the model to generate coherent audio for videos of arbitrary lengths. This solves the problem of audio looping or disjointed transitions, ensuring that a longer video has a continuous, evolving soundscape rather than a repetitive loop.
How Movie Gen Compares to Competitors
The AI video landscape is crowded and fiercely competitive, with major players like OpenAI (Sora), Runway (Gen-3 Alpha), Luma AI (Dream Machine), and Kuaishou (Kling) vying for market dominance. Meta’s Movie Gen enters this arena not just as another tool, but as a comprehensive ecosystem play.
Meta Movie Gen vs. OpenAI Sora
OpenAI’s Sora, announced in February 2024, set the initial high-water mark for high-fidelity AI video. However, a comparison with Movie Gen reveals distinct philosophical and technical divergences.
Table 1: Meta Movie Gen vs. OpenAI Sora
Feature | Meta Movie Gen | OpenAI Sora |
Primary Focus | Personalization & Social Integration | World Simulation & Cinematic Realism |
Maximum Duration | 16 seconds (video), 45 seconds (audio) | Up to 60 seconds |
Audio | Native, synchronized generation (13B model) | Not natively demonstrated in initial launch |
Personalization | High: "Selfie" mode preserves identity | Low: Focus is on new character generation |
Editing | Precise: Instruction-based editing of existing clips | Regenerative: Changes usually require new prompts |
Distribution | Integrated into Instagram/Facebook/WhatsApp | Standalone product / API (Likely) |
Insight: Sora excels at "World Simulation"—creating long, coherent, cinematic shots that look like they belong in a movie theater. Its 60-second duration is a significant technical achievement in temporal consistency. In contrast, Movie Gen excels at "Social Expression." Its 16-second limit is a strategic choice, covering the vast majority of social media use cases (Stories, Reels) while reducing the computational cost per generation. The inclusion of a native audio model gives Meta a "completeness" advantage; a 16-second clip with sound is often more immediately usable than a 60-second silent clip.
Meta vs. Runway Gen-3 & Kling
Runway and Kling (by Kuaishou) are the current market leaders in available tools. Runway Gen-3 Alpha is praised for its realism and "Motion Brush" control tools, while Kling is known for its high frame rates and longer durations (up to 2 minutes in some modes).
The "Editor" Advantage: Runway requires complex prompting and specific tools like "Motion Brush" to achieve granular control. Meta promises a natural language interface for editing ("change the shirt to red"). If this works as advertised in the wild, it drastically lowers the barrier to entry for non-technical users, making video editing accessible to the general public rather than just prosumers.
Benchmark Dominance:
In Meta’s research paper, they present A/B human evaluation results claiming substantial wins over these competitors.
Overall Video Quality: Movie Gen achieved a net win rate of 35.02% versus Runway Gen-3 and 8.23% versus Sora.
Motion Naturalness: Evaluators consistently rated Movie Gen higher for natural movement, reducing the "floaty" or "morphing" artifacts often seen in diffusion-based video.
Personalization: Against tools specialized in personalization like ID-Animator, Movie Gen showed a net win rate of 64.74% for preserving character identity.
It is important to interpret these data points with the caveat that they are Meta-conducted evaluations. However, the use of "blind" human raters is the industry standard for assessing generative quality, as automated metrics (like FVD) often fail to capture human perceptual preference. The relatively close margin against Sora (8.23%) suggests that the visual quality gap is narrow, but the gap against Runway Gen-3 is portrayed as significant.
Under the Hood: The Technology Behind the Magic
To achieve these results, Meta did not simply scale up existing diffusion architectures. They implemented fundamental changes to how the model learns and generates data, moving towards a new paradigm in generative modeling.
Transformers and Flow Matching
The most significant technical shift in Movie Gen is the adoption of Flow Matching over standard Diffusion.
The Limitation of Diffusion: Traditional diffusion models (like Stable Diffusion or the original Make-A-Video) work by adding noise to an image until it is pure static, and then learning to reverse that process to generate an image from noise. While effective, the path from "noise" to "image" is mathematically complex and often inefficient, requiring hundreds of sampling steps. This stochastic process can lead to slow generation times and, in video, temporal inconsistencies (flickering), as the model struggles to decide which path to take for each frame.
The Flow Matching Solution:
Flow Matching is a technique that learns a "velocity field" to transform a simple probability distribution (noise) into a complex one (data) along a straight, optimal path. Conceptually, if Diffusion is like wandering through a foggy forest to find a destination via a random walk, Flow Matching is drawing a straight line on a map and following it.
Temporal Consistency: For video, this "straight line" efficiency means the model can maintain temporal coherence much better. It doesn't "forget" what the object looked like in frame 1 by the time it gets to frame 16. This results in smoother motion and fewer hallucinations where objects randomly morph or disappear.
Efficiency: Flow matching reduces the number of steps needed to train the model effectively and accelerates the generation speed, lowering computational overhead—a critical factor for deploying models to billions of users.
30B Parameter Transformer: Movie Gen uses a transformer backbone (similar to LLMs like Llama 3) rather than the U-Net backbone used in earlier image generation models. Transformers are adept at handling long-range dependencies, which is crucial for video where the model needs to "remember" the beginning of the clip to ensure the end makes sense. The 30B parameter size places it among the largest vision models, allowing for a dense understanding of world physics and semantic concepts.
Training Data & Copyright
The fuel for these models is data, and Meta possesses one of the largest proprietary datasets in the world: the combined repositories of Instagram and Facebook.
The Data Advantage: While OpenAI and Runway scrape the public web—a practice that has led to legal challenges and "grey area" copyright status—Meta has explicitly stated it trains on a combination of licensed data and publicly available datasets. Crucially, Mark Zuckerberg and Meta executives have hinted and confirmed that public Instagram and Facebook posts are used for training.
The "Social" Dataset:
This gives Meta a unique edge. Their data isn't just "video"; it's video with context. Instagram videos have captions, hashtags, comments, and location tags. This rich metadata allows the model to learn associations (e.g., what a "birthday party" looks like in different cultures, or how "joy" is expressed in movement) with a nuance that web-scraped data might lack.
GDPR and Opt-Outs: This data usage has sparked significant controversy, particularly in Europe. Under GDPR, Meta was forced to offer an "opt-out" mechanism for EU users to prevent their data from being used for AI training. Reports indicate that millions of users may have attempted to opt out, though the process has been criticized by privacy groups like NOYB as convoluted (described as "malicious consent trickery"). Meta’s reliance on this data underscores the tension between privacy and AI capability; the very "personal" feel of Movie Gen is derived from the personal data of its users.
Availability & How to Use Meta's AI Video Tools
As of the strategic landscape in late 2024 and looking into 2026, the rollout of Movie Gen is a phased operation designed to integrate the technology deeply into Meta's apps rather than releasing it as a standalone product.
Current Integration in Instagram and Messenger
Meta does not release Movie Gen as a standalone website (like Runwayml.com) for the general public. Instead, it lives where the users are.
"Imagine" Features:
The precursor to full video is the "Imagine" feature currently live in Meta AI chats. Users can generate static images and simple animations (Flash-like motion) inside WhatsApp and Messenger. This serves as the entry point, habituating users to the concept of typing prompts to get visual media.
Emu Video & "Restyle": Inside Instagram, the technology powers features like "Backdrop" (changing the background of a Story) and "Restyle" (applying a visual filter to a photo). These are the "light" versions of Movie Gen’s editing capabilities, likely running on optimized, smaller variants of the models to ensure low latency on mobile devices.
Reels Integration:
The ultimate destination for Movie Gen is the Reels creation interface. The vision is for a "Create with AI" button that allows a user to type a prompt or upload a selfie and generate a 16-second clip to post immediately. This "in-app" friction-less experience is Meta's moat; users do not need to leave the app, subscribe to a third-party service, or export/import files.
The "Blumhouse" Partnership & Creator Rollout
To build trust and demonstrate capability to the professional creative class, Meta partnered with Hollywood production company Blumhouse (famous for Get Out, The Purge) and selected filmmakers like Aneesh Chaganty, the Spurlock Sisters, and Casey Affleck.
"i h8 ai": Aneesh Chaganty’s short film, titled i h8 ai, serves as a fascinating piece of corporate-sponsored critique. In the film, Chaganty uses Movie Gen to upgrade his old childhood home movies, expressing ambivalence about the technology ("I hate AI") while simultaneously demonstrating its power ("But with a tool like this? Maybe I'd have dreamed a little bigger").
Strategic Signal: This partnership signals that Meta wants Movie Gen to be seen as a tool for augmentation of human creativity, not replacement. By putting it in the hands of horror and indie filmmakers, they are marketing it as an artistic instrument rather than just a meme generator. It creates a narrative that professional filmmakers can use these tools to "plus" their work, potentially softening the blow of AI's disruption in the entertainment industry.
Safety, Watermarking, and Ethics
With the power to generate photorealistic videos of people comes the immense risk of deepfakes, misinformation, and non-consensual sexual imagery (NCII). Meta has deployed a defense-in-depth strategy, heavily relying on invisible watermarking.
Invisible Watermarking: VideoSeal
Meta has introduced "VideoSeal," a state-of-the-art invisible watermarking technique designed specifically for the video domain.
Mechanism: Unlike a visible logo or a metadata tag, VideoSeal embeds a signal directly into the pixels of the video (and the frequency spectrum of the audio). This signal is imperceptible to the human eye but can be detected by Meta’s algorithms.
Robustness and "Tamper-Invariance": Crucially, VideoSeal is designed to survive "attacks." If a user takes an AI-generated video, crops it, rotates it, compresses it for WhatsApp, and adds a filter, the watermark remains detectable. This is achieved through a "tamper-invariant" design trained to resist common editing operations. The system uses a localized watermark approach, meaning that even if only a fragment of the video (e.g., a few seconds or a cropped region) is shared, the watermark can still be identified.
Comparison to Google SynthID: Both VideoSeal and Google’s SynthID serve similar purposes, but their implementations differ. SynthID (for images and some video) often relies on modifying the sampling probability during generation. VideoSeal, however, emphasizes localization—the ability to detect which part of a video is AI (e.g., if just a face was swapped). Meta has open-sourced parts of the VideoSeal research (under the "Meta Seal" framework) to encourage industry standardization, recognizing that a watermarking standard is only effective if it is widely adopted.
Engineering Challenges: Scaling this to billions of videos presented massive engineering hurdles. Meta initially faced bottlenecks with GPU-based watermarking due to the inefficiency of video transcoding on GPUs. They successfully transitioned to a CPU-based solution that optimized threading and sampling parameters, achieving end-to-end latency within 5% of GPU performance while significantly improving operational efficiency.
Deepfakes and Public Figure Protection
The "Personalized Video" feature is the highest-risk capability in the Movie Gen suite.
Public Figure Guardrails: Meta’s policy explicitly prohibits generating videos of public figures (politicians, celebrities) for political or commercial endorsements without consent. The model is likely fine-tuned with a "refusal" list—if a prompt asks for "Joe Biden doing a backflip," the model is trained to refuse the request. This is a critical safeguard against election interference.
The "Selfie" Restriction:
For the personalization feature, technical constraints and policy likely require the user to upload their own face (verified via account data) rather than just any photo found online. While the exact mechanism of this verification in the wild remains a challenge, it is a necessary step to prevent the tool from being used to generate non-consensual content of private individuals.
The Future: Will Movie Gen Be Open Source?
The question on every developer’s mind is whether Movie Gen will follow the path of Llama (Open Weights) or ChatGPT (Closed API).
The Llama Precedent vs. Safety Realities
Meta’s Llama series of language models redefined the AI industry by providing powerful open-weights models that rivaled closed competitors like GPT-4. This "scorched earth" strategy commoditized the text market, hurting competitors like OpenAI and Google.
The Case for Open Source: Releasing Movie Gen weights would instantly make Meta the standard-bearer for video research. It would allow the open-source community to optimize, fine-tune, and build upon the architecture, potentially accelerating innovation faster than a closed team could.
The Safety Brake: However, video is different. The potential for harm—specifically non-consensual deepfake pornography and political disinformation—is viscerally higher with photorealistic video than with text. Expert consensus suggests that Meta will likely not release the full 30B Movie Gen weights initially. Instead, they will keep it behind the "walled garden" of Instagram/Facebook to control the safety filters and watermarking.
Yann LeCun’s "World Models": Meta’s Chief AI Scientist, Yann LeCun, views these video models as stepping stones to "World Models"—systems that understand physics and cause-and-effect. For LeCun, the goal isn't just pretty video; it's AGI. A model that can predict the next 16 seconds of video is a model that understands how the world works. This research value might drive a limited research release to universities, even if the commercial model remains closed.
Conclusion: The "Social Cinema" Era
Meta Movie Gen represents more than a technological upgrade; it signals the dawn of "Social Cinema." By integrating Hollywood-grade video generation, editing, and audio synthesis directly into the world’s largest social platforms, Meta is lowering the barrier to creation to near zero.
The shift from "Make-A-Video" to "Movie Gen" is a shift from novelty to utility. With Flow Matching ensuring fluid motion, Personalization ensuring relevance, and Audio ensuring immersion, Movie Gen is poised to be the most widely used AI video tool simply by virtue of its distribution. For the content creator, the message is clear: the toolkit is expanding. The ability to "reshoot" a video with a text prompt or animate a selfie into a travel vlog is no longer science fiction. For the competitor, the challenge is daunting: beating Meta’s quality is hard; beating Meta’s ecosystem is nearly impossible.
While ethical concerns regarding training data and deepfakes remain unresolved, the technology itself has arrived. The "silent era" of AI video is over. The era of the personalized, generative blockbuster has begun.
Detailed Technical Appendix
Table 2: Benchmark Performance (Human Evaluation)
Data derived from Meta's "Movie Gen" Research Paper. Represents "Net Win Rate" (Meta Wins minus Competitor Wins).
Comparison | Metric | Net Win Rate |
Movie Gen vs. Runway Gen-3 | Overall Video Quality | +35.02% |
Movie Gen vs. OpenAI Sora | Overall Video Quality | +8.23% |
Movie Gen vs. Kling 1.5 | Overall Video Quality | +3.87% |
Movie Gen vs. ID-Animator | Identity Preservation | +64.74% |
(Note: A positive score indicates a preference for Movie Gen. The margin against Sora and Kling is tighter than against Runway, indicating a highly competitive high-end market.)
Table 3: The "Cast" of Models
Model Name | Parameters | Function | Key Technical Innovation |
Movie Gen Video | 30 Billion | Text-to-Video, Image-to-Video | Flow Matching, Joint Image/Video Training |
Movie Gen Audio | 13 Billion | Video-to-Audio, Text-to-Audio | Audio Extension, Flow Matching for Audio |
Movie Gen Edit | (derived from 30B) | Instruction-based Video Editing | Latent Masking, Pixel Preservation |
Personalization Adapter | (derived from 30B) | Identity-Preserving Generation | Face encoding injection, ID-loss optimization |


