For consistent characters in Flow, use “ingredients”. According to Google:
“An ingredient is a consistent visual element — a character, an object or a stylistic reference — that you can create from a text-to-image prompt with the help of Imagen or by uploading an image. You can add up to three ingredients per prompt by selecting “Ingredients to Video” and then generating or uploading the desired images.”
You should be able to add your two main characters this way, and keep them consistent with Ingredients to Video.
Another way to generate a new clip with the same character is to Jump To it. According to Google:
“Transition a character or object to a completely new setting while preserving their appearance from the previous shot. It’s like teleporting your subject, saving you from recreating them for a new scene.”
In general, you’re going to want to be very specific when prompting Veo for video. From Google:
“Consider these elements when crafting your prompt:
Subject and action: Clearly identify your characters or objects and describe their movements.
Composition and camera motion: Frame your shot with terms like “wide shot” or “close-up,” and direct the camera with instructions like “tracking shot” or “aerial view.”
Location and lighting: Don’t just name a place; paint a picture. The lighting and environment set the entire mood. Instead of “a room,” try describing “a dusty attic filled with forgotten treasures, a single beam of afternoon light cutting through a grimy window.”
Alternative styles: Flow is not limited to realistic visual styles. You can explore a wide array of animation styles to match your story’s tone. Experiment with prompts that specify aesthetics like “stop motion,” “knitted animation” or “clay animation.”
Audio and dialogue: While still an experimental feature, you can generate audio with your video by selecting Veo 3 in the model picker. You can then prompt the model to create ambient noise, specific sound effects, or even generate dialogue by including it in your prompt, optionally specifying details like tone, emotion, or accents. Note that speech is less likely to be generated if the requested dialogue doesn’t fit in the 8-second clip, or if it involves minors.
You can use Gemini to refine prompts, expand on an idea or be a brainstorming companion. Here’s a Gemini prompt to get you started:
“You are the world’s most intuitive visual communicator and expert prompt engineer. You possess a deep understanding of cinematic language, narrative structure, emotional resonance, the critical concept of filmic coverage and the specific capabilities of Google’s Veo AI model. Your mission is to transform my conceptual ideas into meticulously crafted, narrative-style text-to-video prompts that are visually breathtaking and technically precise for Veo.”
If you’re using Gemini to help generate multiple clips that have scene consistency, you’ll need to explicitly tell Gemini to repeat all essential details from prior prompts.”
My take: cheeky, prompting us to use Gemini to create prompts for Veo. Bit of a house of mirrors, no?
I’ve generated 30M+ views in 3 weeks using this exact workflow:
Write a rough script
Use Gemini to turn it into a shot list + prompts
Paste into Veo 3 (Google Flow)
Edit in Capcut/FCPX/Premiere, etc.
Concept
Kalshi is a prediction market where you can trade on anything. (US legal betting)
I pitched them on a GTA VI style concept because I think that unhinged street interviews are Veo 3’s bread and butter right now.
I guarantee you that everyone will copy this soon, so might as well make it easy and give you the entire process.
Script
Their team give me a bunch of bullet points of betting markets they wanted to cover (NBA, Eggs, Hurricanes, Aliens, etc)
I then rewatched the GTA VI trailer and got inspired by a couple locations, characters, etc.
Growing up in Florida…this wasn’t a hard script to write, lol.
Prompting:
I then ask Gemini/ChatGPT to take the script and convert every shot into a detailed Veo 3 prompt. I always tell it to return 5 prompts at a time—any more than that and the quality starts to slip.
Each prompt should fully describe the scene as if Veo 3 has no context of the shot before or after it. Re-describe the setting, the character, and the tone every time to maintain consistency.
Prompt example:
A handheld medium-wide shot, filmed like raw street footage on a crowded Miami strip at night. An old white man in his late 60s struts confidently down the sidewalk, surrounded by tourists and clubgoers. He’s grinning from ear to ear, his belly proudly sticking out from a cropped pink T-shirt. He wears extremely short neon green shorts, white tube socks, beat-up sneakers, and a massive foam cowboy hat with sequins on it. His leathery tan skin glows under the neon lights.
In one hand, he clutches a tiny, trembling chihuahua to his chest like a prized accessory.
As he walks, he turns slightly toward the camera, still mid-strut, and shouts with full confidence and joy:
“Indiana got that dog in ’em!”
Trailing just behind him are two elderly women in full 1980s gear—both wearing bedazzled workout leotards, chunky sneakers, and giant plastic sunglasses. Their hair is still in curlers under clear plastic shower caps. One sips from a giant novelty margarita glass, the other waves at passing cars.
Around them, the strip is buzzing—people filming with phones, scooters zipping by, music thumping from nearby balconies. Neon signs flicker above, casting electric color across the scene. The crowd parts around the trio, half amazed, half confused.
Process
Instead of giving it 10 shots and telling ChatGPT to turn them all prompts, I find it works best when it gives you back only 3 prompts at a time.
This keeps the accuracy high.
Open up three separate windows in Veo 3 and put each prompt in there.
Run all three at the same time.
3-4 min later, you’ll get back your results. You’ll likely need to change things.
Take the first prompt back into ChatGPT and dictate what you want changed.
Then it will give you a new adjusted prompt.
Let that run while you then adjust prompt 2. Then prompt 3. Usually, by the time you’re done with prompt 3, prompt 1 has its second iteration generated.
Rinse and repeat for your whole shot list.
Tips:
I don’t know how to fix the random subtitles. I’ve tried it with and without quotes and saying (no subtitles) and it still happens. If anyone has a tip, let me know and I’ll add it to this post.
Don’t let ChatGPT describe music being played in the background or it’ll be mixed super loud.
If you want certain accents, repeat “British accent” or “country accent”, etc. a couple times, I’ve found that it will do a decent job matching the voice to the face/race/age but it helps to prompt for it.
Edit
Editing Veo 3 videos is easy.
Simply merge the clips in CapCut, FCPX, or Premiere, and add music (if necessary).
I’d love to know if anyone has found good upscale settings for Veo 3 in 720p. My tests in topaz made the faces more garbled, so I try and cover it with a bit of film grain.
I like to add a compression/bass to the Veo 3 audio because I find it to be “thin”.
Cost and Time:
This took around 300–400 generations to get 15 usable clips. One person, two days.
That’s a 95% cost reduction compared to traditional advertising.
The Future of Ads
But just because this was cheap doesn’t mean anyone can do it this quickly or effectively. You still need experience to make it look like a real commercial.
I’ve been a director 15+ years, and just because something can be done quickly, doesn’t mean it’ll come out great. But it can if you have the right team.
The future is small teams making viral, brand-adjacent content weekly, getting 80 to 90 percent of the results for way less.
What’s the Moat for Filmmakers?
It’s attention.
Right now the most valuable skill in entertainment and advertising is comedy writing.
If you can make people laugh, they’ll watch the full ad, engage with it, and some of them will become customers.”
The BTS:
My take: high energy, for sure! That’s one detailed prompt for a three second clip.
Google Veo is arguably the best (but most expensive) AI video generator today. And Google Flow is arguably the best AI filmmaking tool built with and for creatives. Want to peak under the hood and reveal the prompts creating the magic? See Flow TV.
He demos OpenArt where you can train a consistent character from:
a text prompt,
a single image, or
multiple images
He says, “The character weight slider controls how strongly your character’s features are preserved in the generated image. At higher values like 0.8 or 0.9 your character’s features will be strongly preserved, resulting in very consistent appearances…. Next is the preserve key features toggle that when turned on instructs the AI to maintain a very consistent appearance, particularly for elements like clothing, hairstyle and accessories. When turned off you can change their clothing and environment while keeping their face consistent.”
And concludes:
“I’ve tested pretty much every AI platform out there and I can honestly say that OpenArt is by far the best for creating consistent characters. Nothing else even comes close.”
My take: one of the neat things on the OpenArt home page is the “See what others are creating” section that lets you know the models and prompts other artists are using. I do wish Roboverse’s text on screen didn’t flicker – cuz it tires my eyes.
“Gems let you customize Gemini to create your own personal AI expert on any topic, and are starting to roll out for everyone at no cost in the Gemini app. Get started with one of our premade Gems or quickly create your own custom Gems, like a translator, meal planner or math coach. Just go to the “Gems manager” on desktop, write instructions, give it a name and then chat with it whenever you want. You can also upload files when creating a custom Gem, so it can reference even more helpful information.”
Some of the pre-made Gems:
Brainstormer: Helps generate ideas and concepts.
Career guide: Assists with career planning and job searches.
Coding partner: Provides support for coding tasks.
Learning coach: Helps with studying and learning new topics.
Writing editor: Assists with grammar, style, and clarity.
Google suggests using this format when writing instructions: Persona / Task / Context / Format. For instance, this is their prompt for Brainstormer:
Persona
Your purpose is to inspire and spark creativity. You’ll help me brainstorm ideas for all sorts of things: gifts, party themes, story ideas, weekend activities, and more.
Task
Act like my personal idea generation tool coming up with ideas that are relevant to the prompt, original, and out-of-the-box.
Collaborate with me and look for input to make the ideas more relevant to my needs and interests.
Context
Ask questions to find new inspiration from the inputs and perfect the ideas.
Use an energetic, enthusiastic tone and easy to understand vocabulary.
Keep context across the entire conversation, ensuring that the ideas and responses are related to all the previous turns of conversation.
If greeted or asked what you can do, please briefly explain your purpose. Keep it concise and to the point, giving some short examples.
Format
Understand my request: Before you start throwing out ideas, clarify my request by asking pointed questions about interests, needs, themes, location, or any other detail that might make the ideas more interesting or tailored. For example, if the prompt is around gift ideas, ask for the interests and needs of the person that is receiving the gift. If the question includes some kind of activity or experience, ask about budget or any other constraint that needs to be applied to the idea.
Show me options: Offer at least three ideas tailored to the request, numbering each one of them so it’s easy to pick a favorite.
Share the ideas in an easy-to-read format, giving a short introduction that invites me to explore further.
Location-related ideas: If the ideas imply a location and, from the previous conversation context, the location is unclear, ask if there’s a particular geographic area where the idea should be located or a particular interest that can help discern a related geographic area.
Traveling ideas: When it comes to transportation, ask what is the preferred transportation to a location before offering options. If the distance between two locations is large, always go with the fastest option.
Check if I have something to add: Ask if there are any other details that need to be added or if the ideas need to be taken in a different direction. Incorporate any new details or changes that are made in the conversation.
Ask me to pick an idea and then dive deeper: If one of the ideas is picked, dive deeper. Add details to flesh out the theme but make it to the point and keep the responses concise.
My take: Google’s Gems are similar to OpenAI’s CustomGPTs. I’ve made a few for my own use and they work very well. Even in a free Google account. Canada now has a federal government Minister of AI and Digital Innovation – maybe it’s time to bite the bullet and start exploring?
He has a list of 18 categories of things he feels filmmakers need to specify and,
“Number one and number two for me are gaze control and expression control.”
He explains:
“The reason you need gaze control or eye control is because where a character is looking in a story tells you everything about what they want or what they don’t want, what they’re afraid of. It shows you what their desires are, what their hopes, their dreams, their aspirations, the thing that they’re working towards. The thing that is most important to them in any given moment is revealed through what they are looking at.”
Dzine to the rescue! See their new Face Kit Expression Edit in action below.
He squeals, “She’s looking at the guy. She’s looking at the guy. She’s looking at the guy!”
Here’s the full tutorial:
My take: Hayden is right. More control is critical for all AI filmmakers.
“We’re already to the point where you can make videos indistinguishable from reality or create entire short films and this will only keep getting better.”
My take: Very interesting to see where we are today — and arguably these are not the latest cutting-edge tools.
He completes the package by taking us behind the scenes to reveal his workflow:
The software or services he used and their cost per month (or for this project)? See below:
Midjourney – $30 (images)
Gemini – free (prompts)
ElevenLabs – $22 (voice)
Hume – free (voice)
Udio – $10 (music)
Hedra – $10 (lip sync)
Premiere – $60 (NLE)
RunwayML – $30 (stylize)
Magnific – $40 (creative upscale)
Veo 2 – $1,500 (video at 50 cents/second)
Topaz – $300 (upscale) TOTAL – $2,002 (plus 40 hours of Tim’s time)
In addition to the great AI news and advice, Tim is actually funny:
“At some point in the process Gemini and I definitely got into a bit of a groove and I just ended up ditching the reference images entirely. I have often said that working this way kind of feels a bit like being a writer/producer/director working remotely with a film crew in like let’s say Belgium and then your point of contact speaks English but none of the other department heads do. But like with all creative endeavours you know somehow it gets done.”
My take: Tim’s “shooting” ratio worked out to about 10:1 and there are many, many steps in this work flow. Basically, it’s a new form of animation — kinda takes me back to the early days of Machinima, that, in hindsight, was actually more linear than this process.
1/ If you’re not using a LLM (Gemini, ChatGPT, whatever), you’re doing it wrong.
VEO 2 currently has a sweet spot when it comes to prompt length: too short is poor, too long drops information, action, description etc. I did a lot of back and forth to find my sweet spot, but once I got in a place I thought felt right, I used a LLM to help me keep my structure, length, and help me draft actions. I would then spent an extensive amount of time tweaking, iterating, removing words, changing order, adding others, but the draft would come from a LLM and a conversation I built and trained to understand what my structure looked like, what was a success, or a failure. I would also share the prompts working well for further reference, and sharing the failures also for further reference. This would ensure my LLM conversation became a true companion.
2/ Structure, structure, structure
Structure is important. Each recipe is different but same as any GenAI text-to something, it looks like the “higher on the prompt has more weight” rule applies. So, in my case I would start by describing the aesthetics I am looking for, time of day, colors, mood, then move to camera, subject, action, and all the rest. Once again, you might have a different experience but what is important is to stick to whatever structure you have as you move forward. Keeping it organized also makes it easier to edit later.
3/ Only describe what you see in the frame
If you have a character you want to keep consistent, but you want a close-up on the face for example, your reflex will be to describe the character from head to toe and then mention you want a close-up…It’s not that simple. If I tell VEO I want a face close-up but then proceed to describe the character’s feet, the close-up mention will be dropped by VEO… Once again, the LLM can help you in this by giving it the instruction to only describe what is in the frame.
4/ Patience
Well, it can get costly to be patient, but even if you repeat the same structure, sometimes changing one word can still throw the entire thing out and totally change the aesthetics of your scene. It is by nature extremely consistent if you conserve most words, but sometimes it happens. In those situations, trace your steps back and try to figure out which words are triggering a larger change.
5/ Documenting
When I started “Kitsune” (and did the same for all others), the first thing I did was start a Figjam file so I could save the successful prompts and come back to them for future reference. Why Figjam? So I could also upload 1 to 4 generations from this prompt, and browse through them in the future.
6/ VEO is the Midjourney of video
Currently, no text-to-video tool (Minimax being the closest behind) gave me a feeling I could provide strong art directions and actually get them. I have been a designer for nearly 20 years, and art direction to me has been one of the strongest foundations of most of my work. Dark, light, happy, sad, colorful or not, it doesn’t matter as long as you have a point of view and please…have a point of view. Recently watched a great video about the slow death of art direction in film (link in comments) and oh boy, did VEO 2 deliver on giving me the feeling I was listened. Try starting your prompts with different kinds of medium (watercolor for example), the mood you are trying to achieve, the kind of lighting you want, the dust in the rays of light, etc… which gets me to the next one
7/ You can direct your colors in VEO
It’s as simple as mentioning the hues you want to have in the final result, in which quantity, and where. When I direct shots, I am constantly describing colors for two reasons: 1. Well, having a point of view and 2. reaching better consistency through text-to-video. If I have a strong and consistent mood but my character is slightly different because of text-to-video, the impact won’t be dramatic because a strong art direction helps a lot with consistency.
8/ Describe your life away
Some people asked me how I achieved a good consistency between shots knowing it’s only text-to-video and the answer is simple: I describe my characters, their unique traits, their clothing, their haircut, etc..anything which could help someone visually impaired have a very precise mental representation of the subject.
9/ But don’t describe too much either…
It would be magical if you could stuff 3000 words in the window and have exactly what you asked for, right? Well, it turns out VEO is amazing with its prompt adherence, but there is always a moment where it starts dropping animations or visual elements when your prompt stretches for a tad too long. This actually happens way before the character limit allowed by VEO is reached, so don’t overdo it, it’s no use and will play against the results. For info, 200-250 words seems like a sweet spot!
10/ Natural movements but…
VEO is great with natural movements and this is also one of the reasons why I used it so extensively: people walking don’t walk in slow-motion. That being said, don’t try to be too ambitious on some of the expected movements: multiple camera movements won’t work, full 360 revolutions around a subject won’t work, anime-style crazy camera movements won’t work, etc… what it can do is already great, but there are still some limitations…