Internet Gold | Can You Trust Your Ears? Two Minute Papers (Pt 2 of 2)

9 55

This article is part 2 of my 2-part series about voice audio manipulation. You can find part 1 here.

In part 1 I discussed several methods of faking voice recordings. However, part 1 was just the appetizer for our main course -- which is this article. In part 1 I intentionally left one particular voice manipulation method out. The most powerful of them all -- and to be honest: It is the most scary one, too.

AI -- Artificial Intelligence

Do you remember the scene from the movie Terminator 2 that I gave you as a hint at the end of part 1? Here is a slightly longer version of that clip (less than 3 minutes long):

So what did we see here?

Arnold Schwarzenegger enacted a reprogrammed terminator in that movie. A terminator is a humanoid robot which is usually used for terminating people in a dystopian future. The terminator portrayed by Arnold was reprogrammed and traveled back in time to save John, a young boy, from a more advanced T-1000 terminator robot which was sent back in time in order to kill that same boy.

What happens in that scene is that the more advanced T-1000 shape-shifted into the boy's foster mother. The foster father does not realize that and after he says something that is annoying the shape-shifted terminator, it terminates him... The "Arnold terminator" is with the boy in the phone booth. The shape-shifting terminator is not aware that Arnold traveled back in time as well. Arnold speaks in the boy's voice and uses a fake name for the boy's dog. If Arnold talked to John's foster mother, she would notice that discrepancy. However, the T-1000 apparently does not know the dog's name and is unaware of the trap Arnold laid out for it.

So much to explain the backstory.

Now we get to the part that is the main topic of this article:

The two terminator robots have a conversation over the phone. The T-1000 pretends to be the boy's foster mother, Arnold pretends to be the boy. How do they try to fool each other? By imitating the corresponding human voice. In fact, they do it so perfectly that nobody can tell whether they are speaking to the real human or to a robot.

The shape-shifting T-1000 is not only capable of taking on the appearance of pretty much any solid of a certain size, it can also generate all possible human voices. It does not only match the tone of the voice, but it also imitates little peculiarities of how a person speaks. Arnold's terminator is an earlier model and is not capable of shape-shifting, but the voice generation part was already mastered at the time of his creation (or maybe that feature was added via an over-the-air-update -- who knows πŸ€·β€β™‚οΈ).

The point is: In 1991, when the movie Terminator 2 came out, realistic computer voice generation was just as much science-fiction as shape-shifting.

-- Well, not anymore...

Introducing "Two Minute Papers"

What do the terminators need in order to imitate a person's voice perfectly? A few seconds of hearing that person speak is enough. In that science-fiction movie from 1991.

Let's compare to real science from 2019.

The result: About 5 seconds of voice recording is enough to imitate a person's voice and way of speaking almost perfectly.

How crazy is that? 1991's science fiction became pretty much a reality 28 years later!

And keep in mind that this result is the worse, computer generated voice audio will ever be. (It can only get better in the future.) And this paper is almost one year old, so there are probably even more capable AIs available that run circles around this one...

The take-away message is:

No, we cannot trust our eyes.

Is there a way to detect manipulated or manufactured voice recordings?

There might be, but it has already become very hard for a human to decide what is real and what is not. Our best bet is probably to use an AI to distinguish between artificial voice output and a real person speaking.

But there is a catch: With every improvement in detection, an improvement in creation usually follows closely. The reason is that you can use a detection AI to "train" (improve) a generating AI. It's a vicious circle. In the future, it will just be even harder for humans to decide what is real and what is not... Maybe, it will be impossible to tell real and fake voices apart at some point -- just like in the movie.

But you might think

If I see the person speaking in a video, I will be able to tell whether the audio is fake or real, right? If it is fake, the audio won't match the movement of the lips!

Good idea, but have you heard of deepfake? That is the topic for the next installment of this Internet Gold series... And what we discuss there might shock you. You might want to sit down for that one. But don't worry, not today. I will probably finish writing the article tomorrow. So stay tuned!

What do you think?

Will we ever find a reliable way to check the authenticity of videos or audio recordings? Could blockchain technology help with that (I am thinking of NFT -- non fungible tokens)?

Let me know in the comment section!

Speaking of comments

Have you heard about The Comment League which fellow author @Macronald wants to establish starting November 2020? The goal is to increase meaningful discourse and reader interaction by rewarding those readers who write high-quality comments. Check it out! 😊

Bonus Content -- Make your computer speak like you!

If you are tech-savvy, navigate to this GitHub repository where you can download an unofficial implementation of the AI which was demonstrated in the above video. You can install the software by following the steps (it is not super hard, but also not beginner friendly). Once installed, you can use a short recording of your own voice to train the AI. If everything works out, your computer will speak exactly like you. How cool is that?!

7
$ 1.14
$ 1.04 from @TheRandomRewarder
$ 0.05 from @Macronald
$ 0.05 from @tired_momma
Sponsors of MoreGainStrategies
empty
empty
empty

Comments

Incredible! Complete bonkers! And what? 5 seconds just, to imitate someone else's voice. Unbelievable! Really can't trust my ears and eyes.

You know what I remember this X factor audition, she sounds so amazing they thought she's using some recorded audio and just lip-synching. Sorry but that's the first thing to come in my mind while watching and reading your post. https://youtu.be/1Eti-sFr2ds

$ 0.00
4 years ago

wow...this is a CRAZY good article. you were right! best one yet! perfect for nerds and geeks!

so much good stuff...i'm too scared to try using that AI to imitate me...Lol.

will this go up on P0x?

$ 0.00
4 years ago

Thank you very much. πŸ˜ŠπŸ‘

I plan to publish all my articles on P also, but I will wait a couple of days before I copy them to give Rusty a chance to appreciate them. πŸ˜…

AI is really a super interesting, but also pretty scary topic. I will probably revisit the topic of AI again soon. There is so much to explore in this area. The only question is whether I can make it entertaining enough. I don't want the Internet Gold series to become a lecture series. πŸ˜…

$ 0.00
4 years ago

hey, rusty disappeared again after visiting?

$ 0.00
4 years ago

Yeah, it seems so. I was actually pretty frustrated about it. So frustrated that I wrote an article about Rusty. I will go over it once again before publishing, however. πŸ˜…

$ 0.00
4 years ago

Ok, Rusty is not on holidays. I wonder why he ghosts me like that sometimes... πŸ™ˆ

$ 0.00
4 years ago

πŸ˜₯

$ 0.00
4 years ago

yes, there's a fine balance when creating content...need to know your audience.

good call on the delay...Lol

$ 0.00
4 years ago