AI Video Prompt Battle!

Kevin Hutson of Futurepedia.io has just Tested The Most Complex AI Video Prompts to See What’s Possible.

The four AI video generators he visited are:

He concludes:

“We’re already to the point where you can make videos indistinguishable from reality or create entire short films and this will only keep getting better.”

My take: Very interesting to see where we are today — and arguably these are not the latest cutting-edge tools.

Tim’s AI Workflow for “The Bridge”

Tim Simmons of Theoretically Media made an CGAI (Computer Generated Artificial Intelligence) short film using Google’s new Veo 2 model:

He completes the package by taking us behind the scenes to reveal his workflow:

The software or services he used and their cost per month (or for this project)? See below:

  1. Midjourney – $30 (images)
  2. Gemini – free (prompts)
  3. ElevenLabs – $22 (voice)
  4. Hume – free (voice)
  5. Udio – $10 (music)
  6. Hedra – $10 (lip sync)
  7. Premiere – $60 (NLE)
  8. RunwayML – $30 (stylize)
  9. Magnific – $40 (creative upscale)
  10. Veo 2 – $1,500 (video at 50 cents/second)
  11. Topaz – $300 (upscale)
    TOTAL – $2,002 (plus 40 hours of Tim’s time)

In addition to the great AI news and advice, Tim is actually funny:

“At some point in the process Gemini and I definitely got into a bit of a groove and I just ended up ditching the reference images entirely. I have often said that working this way kind of feels a bit like being a writer/producer/director working remotely with a film crew in like let’s say Belgium and then your point of contact speaks English but none of the other department heads do. But like with all creative endeavours you know somehow it gets done.”

My take: Tim’s “shooting” ratio worked out to about 10:1 and there are many, many steps in this work flow. Basically, it’s a new form of animation — kinda takes me back to the early days of Machinima, that, in hindsight, was actually more linear than this process.

BONUS

Here is the Veo 2 Cheat Sheet by Henry Daubrez that Tim mentions.

1/ If you’re not using a LLM (Gemini, ChatGPT, whatever), you’re doing it wrong.

VEO 2 currently has a sweet spot when it comes to prompt length: too short is poor, too long drops information, action, description etc. I did a lot of back and forth to find my sweet spot, but once I got in a place I thought felt right, I used a LLM to help me keep my structure, length, and help me draft actions. I would then spent an extensive amount of time tweaking, iterating, removing words, changing order, adding others, but the draft would come from a LLM and a conversation I built and trained to understand what my structure looked like, what was a success, or a failure. I would also share the prompts working well for further reference, and sharing the failures also for further reference. This would ensure my LLM conversation became a true companion.

2/ Structure, structure, structure

Structure is important. Each recipe is different but same as any GenAI text-to something, it looks like the “higher on the prompt has more weight” rule applies. So, in my case I would start by describing the aesthetics I am looking for, time of day, colors, mood, then move to camera, subject, action, and all the rest. Once again, you might have a different experience but what is important is to stick to whatever structure you have as you move forward. Keeping it organized also makes it easier to edit later.

3/ Only describe what you see in the frame

If you have a character you want to keep consistent, but you want a close-up on the face for example, your reflex will be to describe the character from head to toe and then mention you want a close-up…It’s not that simple. If I tell VEO I want a face close-up but then proceed to describe the character’s feet, the close-up mention will be dropped by VEO… Once again, the LLM can help you in this by giving it the instruction to only describe what is in the frame.

4/ Patience

Well, it can get costly to be patient, but even if you repeat the same structure, sometimes changing one word can still throw the entire thing out and totally change the aesthetics of your scene. It is by nature extremely consistent if you conserve most words, but sometimes it happens. In those situations, trace your steps back and try to figure out which words are triggering a larger change.

5/ Documenting

When I started “Kitsune” (and did the same for all others), the first thing I did was start a Figjam file so I could save the successful prompts and come back to them for future reference. Why Figjam? So I could also upload 1 to 4 generations from this prompt, and browse through them in the future.

6/ VEO is the Midjourney of video

Currently, no text-to-video tool (Minimax being the closest behind) gave me a feeling I could provide strong art directions and actually get them. I have been a designer for nearly 20 years, and art direction to me has been one of the strongest foundations of most of my work. Dark, light, happy, sad, colorful or not, it doesn’t matter as long as you have a point of view and please…have a point of view. Recently watched a great video about the slow death of art direction in film (link in comments) and oh boy, did VEO 2 deliver on giving me the feeling I was listened. Try starting your prompts with different kinds of medium (watercolor for example), the mood you are trying to achieve, the kind of lighting you want, the dust in the rays of light, etc… which gets me to the next one

7/ You can direct your colors in VEO

It’s as simple as mentioning the hues you want to have in the final result, in which quantity, and where. When I direct shots, I am constantly describing colors for two reasons: 1. Well, having a point of view and 2. reaching better consistency through text-to-video. If I have a strong and consistent mood but my character is slightly different because of text-to-video, the impact won’t be dramatic because a strong art direction helps a lot with consistency.

8/ Describe your life away

Some people asked me how I achieved a good consistency between shots knowing it’s only text-to-video and the answer is simple: I describe my characters, their unique traits, their clothing, their haircut, etc..anything which could help someone visually impaired have a very precise mental representation of the subject.

9/ But don’t describe too much either…

It would be magical if you could stuff 3000 words in the window and have exactly what you asked for, right? Well, it turns out VEO is amazing with its prompt adherence, but there is always a moment where it starts dropping animations or visual elements when your prompt stretches for a tad too long. This actually happens way before the character limit allowed by VEO is reached, so don’t overdo it, it’s no use and will play against the results. For info, 200-250 words seems like a sweet spot!

10/ Natural movements but…

VEO is great with natural movements and this is also one of the reasons why I used it so extensively: people walking don’t walk in slow-motion. That being said, don’t try to be too ambitious on some of the expected movements: multiple camera movements won’t work, full 360 revolutions around a subject won’t work, anime-style crazy camera movements won’t work, etc… what it can do is already great, but there are still some limitations…

Cinema-grade add-on lenses for iPhones?

Jourdan Aldredge on No Film School invites us to Meet the World’s First Cinema-Grade Mobile Lenses for iPhone.

There are at least half a dozen brands of add-on lenses for iPhone cinematography, but these promise to be the first cinema-grade lenses from ShiftCam, working with TUSK.

Beyond the optical quality and build, consider their best use cases:

  1. Discreet Filming in Crowds
  2. Fast-Paced B-Roll Capture
  3. Overhead & Tight Space Shots
  4. Quick Transitions Between Shots
  5. Budget-Friendly Aerial & Water Shots
  6. Scouting Locations
  7. Creative Time-Lapse & Motion Effects
  8. Multi-Cam Filming with Multiple Phones
  9. Professional-Quality Live Streaming
  10. Filming in Extreme Weather Conditions

Here’s the link to the Kickstarter campaign. Not cheap.

My take: I would love to see real-world test footage and charts from these lenses.

Workflow to create aerial clips

Rory Flynn has shared a workflow that uses a combination of AI tools to create aerial clips.

The tools are: Claude 3.7, Magnific and Runway.

The workflow is:

  1. Build a 3D Render in Claude 3.7
  2. Program in camera movements
  3. Screen record the render
  4. Upload this video to Runway Gen-3
  5. Extract the first frame
  6. Apply a Magnific Structure Reference to the first frame
  7. Upload this new first first frame in Runway
  8. Apply the new first frame to the initially rendered video using Runway Restyle.

The Claude prompt he used in Step 1 is: “can you code a 3d version of [subject + env] in three.js?” E.g. “can you code a 3d version of an epic castle atop a mountain plateau in a valley in three.js?

The Magnific Structure Reference he used in Step 6 is: “editorial photo, epic castle on a plateau, intricate rocky textures and fine details, immaculate New Zealand landscape, white marble castle, high precision photography” with these settings:

  • Model: Mystic 2.5
  • Structure Reference
  • Structure Strength: 52%
  • Resolution: 2k
  • Creative Detailing: 75%
  • Engine: Magnific Sharpy

See his X post or LinkedIn post.

See an interview with Rory on AI in business.

My take: amazing!

Riffusion generates full songs effortlessly

Riffusion has just opened a public beta and it rocks!

Riffusion is the brainchild of Hayk Martiros and Seth Forsgren.

“Our goal is to make everyone into a musician and bring a future where music is interactive and personalized.”

TechCrunch reported their $4M seed funding in October 2023.

My take: damn! Not only will this create full songs, it will also create stems you can download for further modification in your DAW of choice.

Best Open Source TTS: Kokoro

There is a new open source Text to Speech generator in town called Kokomo-82M.

As far as I can determine, it’s being developed by one person, Hexgrad, based on earlier models.

Apparently, this is something you can install and run locally on your own computer.

You can try it out online here. You can also compare various open source models at the TTS Arena.

My take: note that this does not clone voices or emote (at all.) Perhaps in the next version?

Generated Video and Emotions

Haydn Rushworth has just released COMPARED: 10 AI Emotions – Minimax / Hailuo.ai 12V-01-live vs KLING, VIDU, Runway.

He compares Minimax with Runway, Vidu and Kling.

His conclusions?

Runway was the most sedate whereas Kling was all over the place. Vidu was good, but Minimax was his favourite.

Tao Prompts also compares Sora, Kling, Minimax and Runway.

He concludes that Runway doesn’t tend to add much emotion at all.

My take: it appears that Minimax may be the best platform to generate video from images at the close of 2024. What will 2025 bring us?

How to Create Consistent AI Characters

Caleb Ward of Curious Refuge has released 2024’s best summary of how to Create Consistent Realistic Characters Using AI.

He suggests using Fal.AI to train a custom LoRA ( fal.ai/models/fal-ai/flux-lora-fast-training ) with at least 10 images of the subject. Then use this model to generate images ( fal.ai/models/fal-ai/flux-lora ) and increase their resolution using an up-res tool. Finally, you can now move on to animating them.

CyberJungle, the Youtube channel of Hamburg-based Senior IT Product Manager Cihan Unur, also posted How to Create Consistent Characters Using Kling AI.

He details how to train a LoRA on Kling using at least eleven videos of your character. Admittedly, this pipeline is a little more involved. He also suggests FreePik as another option.

My take: basically, if you can imagine it, you can now create it.

Any pose in MJ: ECU on a detail and then ZOOM OUT

Glibatree (Ben Schade) recently implored on YouTubeDo THIS to Create Amazing Poses in Midjourney!!!

The problem with a lot of image generators is that they love selfies: front-facing portraits. But what if you want a profile? Ben has a two-step work-around:

“Generate a close-up photo of your subject’s ear and then use the editor to zoom out and create the rest of the image.”

He explains:

“The reason this works is because what Midjourney needed was a pattern interrupt. Take advantage of its usual way to generate images by finding the usual way to generate an image with a more unusual focus. It’s better to choose a focus that is already often viewed from the angle we want.

  • focus on a ponytail if we want to see the back of someone’s head
  • use a receding hairline to see someone from straight above
  • focus on the back pocket of a pair of jeans if you want the…
  • I wouldn’t recommend looking up someone’s nostril (I mean it’s an angle that works but I just wouldn’t recommend it.)

The point is we can generate any of these things using extremely simple prompts and get very unusual angles to be seeing a person from. And then starting from there once we have the angle well defined we can simply zoom out and make our chosen feature less prominent by changing our prompt to something else and so in the new image the angle we wanted is extremely well defined not by tons of keywords but by the part of the image we already generated.”

This works for Expressions as well. He explains:

“If we start with a photo of just a smile or just closed eyes or just a mischievous smirk, Midjourney will spend all of its effort to create a high quality closeup version of the exact expression we wanted that now, in just one more generation, we can apply to our character by simply zooming out.”

My take: thank you, Ben, for cracking the code!