Generating Deepfakes With Custom Audio and Lipsync

Open source AI tools are improving at an incredible pace. I’ve been playing with a few of them and wanted to share some of the results. The goal was to generate a deepfake using an existing scene from a movie, and replace the audio with a custom voice, with the constraint that all execution happens locally on M2 Macbook. The result is a video of a character saying something they never said, with their lips moving in sync with the new audio. The conversion process is broken down into 3 steps:

Faceswap - Extract and substitute faces

The first step is to swap the faces from the video. This is done using Roop. For conda users, add the following to .condarc to make sure the correct packages are installed for apple silicon:

channels:
 - main/osx-arm64
 - main/noarch
 - free/osx-64
 - free/noarch
 - r/osx-64
 - r/noarch

yt-dlp and ffmpeg can be used to prepare the original video. Run the faceswap model as follows:

python run.py --gpu-vendor=apple --gpu-threads=32  -f=face.jpg -t=input.mp4 -o=faceswap.mp4

Personalized TTS - Generate custom audio from text

Next, we generate custom audio to overlay on the video. Tortoise-TTS offers a simple way to generate audio from text, allowing for zero-shot mimicking using 3-4 samples of 10 seconds each. The audio is generated using the following command:

python tortoise/do_tts.py --text "<overlay_text>" --voice <custom_voice_sample_folder> --preset fast

Lipsync TTS - Sync lips with custom audio

Finally, merge the faceswapped video with custom TTS audio to generate the deepfake. Simply replacing the background audio is insufficient, as the lips will not be in sync with the new audio. Wav2Lip is used to sync the lips with the following command:

python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face faceswap.mp4 --audio custom_audio.wav --outfile deepfake.mp4