Breaking · Google Flow

Google's "Gemini Omni" Generates Video from Any Input

June 1, 2026 at 16:39 EDT

Google announced its new multimodal AI "Gemini Omni," introducing an "any input to video" capability that can generate and edit high-quality video by freely combining text, images, audio, and video. Unveiled at Google I/O around May 19, 2026, the first implementation, "Gemini Omni Flash," is immediately available in the Gemini app (paid plans), YouTube Shorts, and Google Flow. 1

What Happened

Google disclosed details on its official blog "Introducing Gemini Omni" and on a DeepMind page. It is a model that integrates Gemini's world understanding (physics, history, science, culture) with video generation capabilities, described as "create anything from any input – starting with video." 2 3

The main features are as follows.

Conversational, step-by-step video editing via natural language (making sequential edits while maintaining consistency)
Consistent output integrating arbitrary references such as images, text, video, and audio
Physics-compliant generation grounded in real-world knowledge, plus style conversions such as claymation-like, hologram-like, and voxel-art-like styles
Creation of a "digital avatar" that reproduces your own appearance and voice

Gemini Omni Flash's generation length is currently up to 10 seconds; this is said to be not a model limitation but a design choice considering early availability and user demand, with longer durations planned for future expansion. An API is expected to become available within a few weeks. All outputs carry a SynthID watermark. The higher-performance "Gemini Omni Pro" is slated for release once clear improvements have been confirmed. 1

Background and Significance

Gemini has aimed to be natively multimodal from the start, and Omni is positioned as its evolution. By combining the dedicated video model Veo with Gemini's reasoning capabilities, the goal is to achieve not mere synthesis but "meaningful storytelling." It is effectively the video counterpart to the image-editing model "Nano Banana," with conversational editing that requires no specialized tools as its differentiator. It comes with SynthID built in as standard, also emphasizing countermeasures against deepfakes. 1

According to TechCrunch, CEO Pichai described this as "a step toward world models." He emphasized ease of use for consumers and text-rendering accuracy for creators. Similar moves include Luma AI's integrated intelligence model, but Google puts ease of use and consistency front and center. 1

Reactions

The distributor, Google Flow (@FlowbyGoogle), published a thread showcasing use cases rapidly spreading in the community. Voices from users who actually tried it were generally favorable, with expressions like "insane" and "magic tool" standing out.

Specifically, there are reports of natural-language editing instructions such as "add a mechanical pack to her back at 7:03" being reflected in fine detail while maintaining consistency. 6 Also praised are the lowered barrier to UGC video production using one's own avatar, and the high consistency of scene editing, object replacement, and style conversion ("you can fix it with a prompt without regenerating"). Negative voices were few within the scope of this survey, and conversational video editing is drawing attention even in non-English-speaking regions.

Google's official site and official YouTube video also introduce it as the "Omni-verse," emphasizing creative-studio uses through integration with Flow. 4

Source post →