, 12 tweets, 3 min read
My Authors
Read all threads
There's been a lot written about the social implications of deepfakes, but less about how they actually work. Here's a thread about that. Read my article here for the full details. arstechnica.com/science/2019/1…
The goal of a deepfake is to start with a video featuring one person's face, and replace it with a different person's face, while preserving the original face's position, expression, illumination, etc.
The core of most deepfake software today is an autoencoder. That's a neural network that's been trained to take in an image and output an identical image. Training an autoencoder is easy because you know exactly what the output should look like (same as the input).
Of course an autoencoder could work by just directly copying each pixel from input to output, but that wouldn't be interesting. Instead, the autoencoder "squeezes" a face down to a compact representation called a latent space, then expands it again.
Limiting how much information the front end of the network (called the encoder) can pass to the back end (the decoder) forces the network to learn a compact representation for the human face—a way to summarize any position and expression with a few dozen or hundred numbers.
OK, how does that get you a deepfake? Well, you train two autoencoders side by side: one that's trained to recognize and reproduce the original face, the other to recognize and reproduce the swap face.
But there's a twist: You use the same encoder for both networks. When you train one network, it simultaneously modifies the encoder (but not the decoder) of the other network.
The practical effect of this is that you get two networks whose face representations are "compatible." The encoder's latent space representation for (say) "facing to the left, smiling, with eyes open" will be the same whether the input was a person A photo or a person B photo.
And that fact that the two networks use their latent spaces in the same way means that the decoders are also compatible: you can take a latent representation intended for one decoder and feed it to the other decoder instead.
And that's how you do a deepfake. During training, you take images of person A, encode them, and then decode them with the person A decoder to see how well they match. But during deepfaking, you encode a photo of person A, then decode it with the person B decoder.
The person B decoder won't know the difference. It will draw person B using the information in the latent space. So you get an image with person B's physical appearance, but with the pose, expression, and illumination of person A.
Thanks to @aurich for the illustrations in this thread. Thanks to the Faceswap team for creating open source deepfake software that even a neophyte like me can learn to use in a few hours.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Timothy B. Lee

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!