At today's Google I/O keynote, DeepMind CEO Demis Hassabis walked onstage and made the kind of claim that usually gets you laughed off the stage: a model that creates any output from any input. The model is called Gemini Omni, and it represents Google's first serious attempt to unify video, image, audio, and text generation under a single architecture.
Hassabis framed Omni as a step toward artificial general intelligence, a prediction he repeats frequently. But the on-stage demos suggested something more immediately practical. Users can feed the system any combination of inputs and receive coherent video output. The company showed the model generating clips from text prompts, reference images, existing footage, and audio files. According to Google, Omni can handle complex, multi-turn editing directly in chat.
Physics as a Feature
The most striking technical claim is that Omni understands physics. The model can simulate realistic interactions between materials and characters. Gravity and motion behave plausibly in generated scenes. This builds on DeepMind's long-running world model research, which Hassabis has described as essential infrastructure for AGI. The ability to generate physically coherent environments is what separates world models from pure image synthesis.
Early leaks had already teased some of Omni's capabilities. A sample video showing a professor deriving trigonometric equations on a chalkboard went viral last week. The equations rendered correctly, and the handwriting, timing, and spatial logic remained coherent throughout. For anyone who has watched AI video generators fail at basic text rendering, this is a notable achievement.
What This Means for the AI Video Race
Current top-tier video models like Veo 3.1, Seedance 2.0, and Kling 3.0 are specialized tools. You give them a prompt, they give you a clip, and iteration means starting over. Omni takes a different approach: conversational creation. You describe what you want, get a draft, and then talk the model through revisions. Object replacement, scene rewriting, and style changes happen through follow-up instructions. If the workflow holds up in production, this eliminates the timeline-based editing paradigm entirely for short-form content.
Leaked technical data suggests generated clips currently cap at around 10 seconds at 1280x720 resolution. That's mid-range by current standards. The model's internal ID, discovered in the Gemini app earlier this month, confirms it's architecturally distinct from Veo, not a rebrand. Google appears to be running two parallel video model tracks: Veo for API and enterprise customers, Omni for consumer-facing Gemini experiences.
The Bigger Picture
Google has been telegraphing this direction for over a year. Hassabis told Reid Hoffman's podcast last April that Google intended to merge Gemini and Veo to improve the model's understanding of the physical world. Omni appears to be that merger realized. The company's strategy involves using video generation not primarily as a consumer product but as training infrastructure for agents that must act in reality.
For Google, the timing is strategic. OpenAI's Sora App shut down in late April, leaving a vacuum in the consumer AI video space. Omni fills it with a product that's more ambitious in scope. Whether that ambition translates to better outputs than what Runway, Kling, and others already offer remains to be seen. The early demos are impressive. The leaked samples show high prompt adherence and clean text rendering. But demos shown at developer conferences are not production software.
Pricing has not been disclosed. Based on existing Veo pricing, analysts expect per-second API costs between $0.10 and $0.30 depending on quality tier. Gemini Advanced subscribers will likely get limited free access.
The model is rolling out now to some users in the Gemini app and Google Flow. Broader availability is expected over the coming weeks.


