This article is part 2 of my 2-part series about voice audio manipulation. You can find part 1 here.
In part 1 I discussed several methods of faking voice recordings. However, part 1 was just the appetizer for our main course -- which is this article. In part 1 I intentionally left one particular voice manipulation method out. The most powerful of them all -- and to be honest: It is the most scary one, too.
AI -- Artificial Intelligence
Do you remember the scene from the movie Terminator 2 that I gave you as a hint at the end of part 1? Here is a slightly longer version of that clip (less than 3 minutes long):
So what did we see here?
Arnold Schwarzenegger enacted a reprogrammed terminator in that movie. A terminator is a humanoid robot which is usually used for terminating people in a dystopian future. The terminator portrayed by Arnold was reprogrammed and traveled back in time to save John, a young boy, from a more advanced T-1000 terminator robot which was sent back in time in order to kill that same boy.
What happens in that scene is that the more advanced T-1000 shape-shifted into the boy's foster mother. The foster father does not realize that and after he says something that is annoying the shape-shifted terminator, it terminates him... The "Arnold terminator" is with the boy in the phone booth. The shape-shifting terminator is not aware that Arnold traveled back in time as well. Arnold speaks in the boy's voice and uses a fake name for the boy's dog. If Arnold talked to John's foster mother, she would notice that discrepancy. However, the T-1000 apparently does not know the dog's name and is unaware of the trap Arnold laid out for it.
So much to explain the backstory.
Now we get to the part that is the main topic of this article:
The two terminator robots have a conversation over the phone. The T-1000 pretends to be the boy's foster mother, Arnold pretends to be the boy. How do they try to fool each other? By imitating the corresponding human voice. In fact, they do it so perfectly that nobody can tell whether they are speaking to the real human or to a robot.
The shape-shifting T-1000 is not only capable of taking on the appearance of pretty much any solid of a certain size, it can also generate all possible human voices. It does not only match the tone of the voice, but it also imitates little peculiarities of how a person speaks. Arnold's terminator is an earlier model and is not capable of shape-shifting, but the voice generation part was already mastered at the time of his creation (or maybe that feature was added via an over-the-air-update -- who knows π€·ββοΈ).
The point is: In 1991, when the movie Terminator 2 came out, realistic computer voice generation was just as much science-fiction as shape-shifting.
-- Well, not anymore...
Introducing "Two Minute Papers"
What do the terminators need in order to imitate a person's voice perfectly? A few seconds of hearing that person speak is enough. In that science-fiction movie from 1991.
Let's compare to real science from 2019.
The result: About 5 seconds of voice recording is enough to imitate a person's voice and way of speaking almost perfectly.
How crazy is that? 1991's science fiction became pretty much a reality 28 years later!
And keep in mind that this result is the worse, computer generated voice audio will ever be. (It can only get better in the future.) And this paper is almost one year old, so there are probably even more capable AIs available that run circles around this one...
The take-away message is:
No, we cannot trust our eyes.
Is there a way to detect manipulated or manufactured voice recordings?
There might be, but it has already become very hard for a human to decide what is real and what is not. Our best bet is probably to use an AI to distinguish between artificial voice output and a real person speaking.
But there is a catch: With every improvement in detection, an improvement in creation usually follows closely. The reason is that you can use a detection AI to "train" (improve) a generating AI. It's a vicious circle. In the future, it will just be even harder for humans to decide what is real and what is not... Maybe, it will be impossible to tell real and fake voices apart at some point -- just like in the movie.
But you might think
If I see the person speaking in a video, I will be able to tell whether the audio is fake or real, right? If it is fake, the audio won't match the movement of the lips!
Good idea, but have you heard of deepfake? That is the topic for the next installment of this Internet Gold series... And what we discuss there might shock you. You might want to sit down for that one. But don't worry, not today. I will probably finish writing the article tomorrow. So stay tuned!
What do you think?
Will we ever find a reliable way to check the authenticity of videos or audio recordings? Could blockchain technology help with that (I am thinking of NFT -- non fungible tokens)?
Let me know in the comment section!
Speaking of comments
Have you heard aboutΒ The Comment LeagueΒ which fellow author @Macronald wants to establish starting November 2020? The goal is to increase meaningful discourse and reader interaction by rewarding those readers who write high-quality comments. Check it out! π
Bonus Content -- Make your computer speak like you!
If you are tech-savvy, navigate to this GitHub repository where you can download an unofficial implementation of the AI which was demonstrated in the above video. You can install the software by following the steps (it is not super hard, but also not beginner friendly). Once installed, you can use a short recording of your own voice to train the AI. If everything works out, your computer will speak exactly like you. How cool is that?!
Incredible! Complete bonkers! And what? 5 seconds just, to imitate someone else's voice. Unbelievable! Really can't trust my ears and eyes.
You know what I remember this X factor audition, she sounds so amazing they thought she's using some recorded audio and just lip-synching. Sorry but that's the first thing to come in my mind while watching and reading your post. https://youtu.be/1Eti-sFr2ds