Jonathan Fischoff Profile picture
Nov 30, 2023 11 tweets 4 min read Read on X
“Animate Anyone” was released last night for making pose guide videos. Lets dive in.

Paper:
Project:
🧵1/ arxiv.org/abs/2311.17117
humanaigc.github.io/animate-anyone/
Image
First some examples because they are very good 2/
Okay so how did the pull this off? They made a bunch of modifications to the AnimateDiff architecture. First their arch overview 3/ Image
You input a picture of a character and then drive it with a sequence of poses.

To make sure the reference image is represent with high fidelity they CLIP encode it and send the embeddings to the cross-attention of denoising U-Net but that is not enough 4/ Image
They train an auxiliary “ReferenceNet” version of SD. This network is used to provide high fidelity information about the input image to the denoising AnimateDiff version of SD.

Here is how they do, but I’ll try to break this down 5/ Image
The self-attention features of the ReferenceNet are concatenated width-wise to the denoising network features. Then they perform self-attention, but the output features are too wide, so they chop off the extra width before passing them to the next block.

Perf hit not bad 6/ Image
The poses are encoded using a ControlNet convolution network instead of VAE, called the “Pose Guider”. Fine, but not sure why that is better than a VAE oh well. 7/ Image
Two stage training. First, initialize both the denoising U-Net and ReferenceNet to a SD checkpoint. The Pose Guider is gaussian initialized with zero’d projection portion as is standard for ControlNet

The goal of stage one is spatial training so reconstruction of the frames 8/ Image
Then they train the temporal attention in isolation as is standard for AnimateDiff training.

During inference they resize the poses to the size of the input character. For longer generation they use a trick from “Editable Dance Generation From Music” () 9/ arxiv.org/abs/2211.10658
Image
I’m still unclear on that final bit about long duration animations. They are generating more than 24 frames in the examples and they are incredibly coherent so it obviously works well.

Hopefully the code comes out soon, but if not the paper is detailed enough to repro. 10/
Digging into the EDGE paper more, I think they are using in-painting to generate longer animations. I've tried an in-painting like approach with AnimateDiff to make longer animations, it doesn't work well in general. Maybe it works here because of the extra conditioning 🤷‍♂️11/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jonathan Fischoff

Jonathan Fischoff Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jfischoff

Sep 19, 2024
This is the most exciting paper I've read in a while.

Alternate title could have been: "One weird trick to increase your depth map inference 200x."

arxiv:
github:

Let's go through the details 🧵 1/9 arxiv.org/abs/2409.11355
github.com/VisualComputin…
Image
Some background. Marigold is a fine-tuned 8-channel version of SD2.1 for depth maps. It concatenates a clean guide image with noisy latent along the channel direction.

It works great but is slow. The authors looked at what happened if they used only a single sampling step. 2/9 Image
When using only a single step with Marigold or Stable Diffusion, the output is basically noise. However, with a tiny configuration change, they were able to get much improved output. 3/9 Image
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(