Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jonathan Fischoff

@jfischoff

Nov 30, 2023 • 11 tweets • 4 min read • Read on X

Scrolly

“Animate Anyone” was released last night for making pose guide videos. Lets dive in.

Paper:
Project:
🧵1/ arxiv.org/abs/2311.17117
humanaigc.github.io/animate-anyone/

First some examples because they are very good 2/

Okay so how did the pull this off? They made a bunch of modifications to the AnimateDiff architecture. First their arch overview 3/

You input a picture of a character and then drive it with a sequence of poses.

To make sure the reference image is represent with high fidelity they CLIP encode it and send the embeddings to the cross-attention of denoising U-Net but that is not enough 4/

They train an auxiliary “ReferenceNet” version of SD. This network is used to provide high fidelity information about the input image to the denoising AnimateDiff version of SD.

Here is how they do, but I’ll try to break this down 5/

The self-attention features of the ReferenceNet are concatenated width-wise to the denoising network features. Then they perform self-attention, but the output features are too wide, so they chop off the extra width before passing them to the next block.

Perf hit not bad 6/

The poses are encoded using a ControlNet convolution network instead of VAE, called the “Pose Guider”. Fine, but not sure why that is better than a VAE oh well. 7/

Two stage training. First, initialize both the denoising U-Net and ReferenceNet to a SD checkpoint. The Pose Guider is gaussian initialized with zero’d projection portion as is standard for ControlNet

The goal of stage one is spatial training so reconstruction of the frames 8/

Then they train the temporal attention in isolation as is standard for AnimateDiff training.

During inference they resize the poses to the size of the input character. For longer generation they use a trick from “Editable Dance Generation From Music” () 9/ arxiv.org/abs/2211.10658

I’m still unclear on that final bit about long duration animations. They are generating more than 24 frames in the examples and they are incredibly coherent so it obviously works well.

Hopefully the code comes out soon, but if not the paper is detailed enough to repro. 10/

Digging into the EDGE paper more, I think they are using in-painting to generate longer animations. I've tried an in-painting like approach with AnimateDiff to make longer animations, it doesn't work well in general. Maybe it works here because of the extra conditioning 🤷‍♂️11/

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jfischoff

Jonathan Fischoff

@jfischoff

Sep 19, 2024

This is the most exciting paper I've read in a while.

Alternate title could have been: "One weird trick to increase your depth map inference 200x."

arxiv:
github:

Let's go through the details 🧵 1/9 arxiv.org/abs/2409.11355
github.com/VisualComputin…

Some background. Marigold is a fine-tuned 8-channel version of SD2.1 for depth maps. It concatenates a clean guide image with noisy latent along the channel direction.

It works great but is slow. The authors looked at what happened if they used only a single sampling step. 2/9

When using only a single step with Marigold or Stable Diffusion, the output is basically noise. However, with a tiny configuration change, they were able to get much improved output. 3/9

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jonathan Fischoff

Try unrolling a thread yourself!

More from @jfischoff

Jonathan Fischoff

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!