Jonathan Fischoff Profile picture
Nov 30 11 tweets 4 min read Twitter logo Read on Twitter
“Animate Anyone” was released last night for making pose guide videos. Lets dive in.

Paper:
Project:
🧵1/ arxiv.org/abs/2311.17117
humanaigc.github.io/animate-anyone/
Image
First some examples because they are very good 2/
Okay so how did the pull this off? They made a bunch of modifications to the AnimateDiff architecture. First their arch overview 3/ Image
You input a picture of a character and then drive it with a sequence of poses.

To make sure the reference image is represent with high fidelity they CLIP encode it and send the embeddings to the cross-attention of denoising U-Net but that is not enough 4/ Image
They train an auxiliary “ReferenceNet” version of SD. This network is used to provide high fidelity information about the input image to the denoising AnimateDiff version of SD.

Here is how they do, but I’ll try to break this down 5/ Image
The self-attention features of the ReferenceNet are concatenated width-wise to the denoising network features. Then they perform self-attention, but the output features are too wide, so they chop off the extra width before passing them to the next block.

Perf hit not bad 6/ Image
The poses are encoded using a ControlNet convolution network instead of VAE, called the “Pose Guider”. Fine, but not sure why that is better than a VAE oh well. 7/ Image
Two stage training. First, initialize both the denoising U-Net and ReferenceNet to a SD checkpoint. The Pose Guider is gaussian initialized with zero’d projection portion as is standard for ControlNet

The goal of stage one is spatial training so reconstruction of the frames 8/ Image
Then they train the temporal attention in isolation as is standard for AnimateDiff training.

During inference they resize the poses to the size of the input character. For longer generation they use a trick from “Editable Dance Generation From Music” () 9/ arxiv.org/abs/2211.10658
Image
I’m still unclear on that final bit about long duration animations. They are generating more than 24 frames in the examples and they are incredibly coherent so it obviously works well.

Hopefully the code comes out soon, but if not the paper is detailed enough to repro. 10/
Digging into the EDGE paper more, I think they are using in-painting to generate longer animations. I've tried an in-painting like approach with AnimateDiff to make longer animations, it doesn't work well in general. Maybe it works here because of the extra conditioning 🤷‍♂️11/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jonathan Fischoff

Jonathan Fischoff Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(