Scientists working at Stanford University, the Max Planck Institute for Informatics, Princeton University and Adobe Research have developed a technique that synthesizes new video frames from an edited interview transcript.
In other words, soon we’ll be able to alter speech in video clips simply by typing in new words:
“Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript.”
Why do this?
“Our main application is text-based editing of talking-head video. We support moving and deleting phrases, and the more challenging task of adding new unspoken words. Our approach produces photo-realistic results with good audio to video alignment and a photo-realistic mouth interior including highly detailed teeth.”
Read the full research paper.
My take: Yes, this could be handy in the editing suite. But the potential for abuse is very concerning. The ease of creating Deep Fakes by simply typing new words means that we would never be able to trust any video again. No longer will a picture be worth a thousand words; rather, one word will be worth a thousand pixels.