Zach Mueller Profile picture
Sep 2, 2022 9 tweets 5 min read Read on X
You may know that @huggingface Accelerate has big-model inference capabilities, but how does that work?

With the help of #manim, let's dig in!

Step 1:
Load an empty model into memory using @PyTorch's `meta` device, so it uses a *super* tiny amount of RAM
Step 2:
Load a single copy of the model's weights into memory
Step 3:
Based on the `device_map`, store the checkpoint weights using @numpy or move it to a device for each group of parameters, and reset our memory
Step 4:
Load a shard of the offloaded weights to the original empty model from the beginning onto the CPU and add hooks to change device placements
Step 5:
Pass an input through the model, and the weights will automatically be placed from CPU -> GPU and back through each layer.

You're done!
Now here's the entire process (sped up slightly)
If you're interested in learning more about Accelerate or enjoyed this tutorial, be sure to read the full tutorial (with the complete animation) in the documentation!

huggingface.co/docs/accelerat…
This was completely inspired by @_ScottCondron, manim certainly has a learning curve but I think it came out pretty okay :)
@huggingface @PyTorch For those who want to see the @manim_community code, it's live here: github.com/huggingface/ac…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Zach Mueller

Zach Mueller Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TheZachMueller

Aug 28, 2023
Excited to announce a new @huggingface space to help with one of machine learning's biggest questions:

How much space does {X} model take in vRAM? And most importantly: when using `device_map="auto"`

huggingface.co/spaces/hf-acce…
This space helps utilize accelerate`s big model inference library and device_map="auto" to load in a model and measure it's skeleton to help you estimate the largest layer and the total size of the model you want to load into memory! (and only using a tiny amount of RAM) Image
The total size and largest layers are estimates within a few percent of the actual model size, for example bert-base-cased is estimated to take up 413.18 MB of vRAM, and it actually takes 413.68 MB: Image
Read 7 tweets
Oct 13, 2022
Today marks an extremely exciting day for fans of #nbdev, I'm releasing a new project, "nbdev-extensions"! This pypi package will contain features myself and others have thought of and I've brought to life in the nbdev framework for everyone to try!

muellerzr.github.io/nbdev-extensio…

1/5
The first extension is a `new_nb` command. This will quickly generate a new blank template notebook for you to immediately dive into as you're exploring nbdev, and is fully configurable for how your notebook's content should be:

2/5
The second (and my favorite) extension is a new note annotation tool I'm calling "Code Notes". Take a code cell above, and in markdown cells below you write notes on particular sections of that code. The documentation will reflect these notes in a beautiful table:

3/5
Read 5 tweets
Jul 6, 2022
New article on #python decorators is out! Specifically this shows you how decorators are written, what they do, and the power you can do with them. I even show an example of when you'd use the strange "nonlocal" 1/3
muellerzr.github.io/fastblog/pytho…
Context manager sequel should be out in the next few days. This one will take a bit longer because in some cases decorators are context managers, and they also have a few more rules so it'll take some time for me to get that how I want it :) 2/3
The other aim with these two is to give you easy-to-view boilerplate examples of decorators and context managers to play with, and explain how they work.

Why? Because I've been wanting those for many months now, and could really use them myself for reference 3/3
Read 4 tweets
Jul 5, 2022
Listened to everyone's response with the new `no_sync` wrapper in @huggingface's Accelerate and I took it to heart.

Here's our new gradient accumulation context manager available in Accelerate dev now! A thread on design choices and the struggles 1/4🧵 Image
@huggingface The goal with Accelerate is abstract as very little as we possibly can for you to perform what you want on any training device (CPU, multi-gpu, etc). As a result, it came to a decision of "how can we simplify gradient accumulation, without hiding anything?" 2/4
@huggingface A compromise was found, where instead we focus on deleting your duplicated code that would come from performing gradient accumulation and also help with the loss as well. It doesn't reduce the clarity of the code, and lets it be consistent across platforms 3/4 ImageImage
Read 4 tweets
May 19, 2022
A few tips and tricks I learned about @Docker today and keeping image sizes small 🧵
Use a multi-stage approach to keep the resulting image lightweight by pre-compiling all of the installs and then just bringing in those installed files to the end image. I could save 500mbs + in some cases by doing this
The second trick I learned (which should be an obvious one!) is to install the direct torch wheel based on what you're using. For example, if you're using CPU but don't specify the CPU wheel, your docker image can be 2gb when in reality it only needs to be 800mb's or so!
Read 4 tweets
Feb 6, 2022
Tonight we're talking about @fastdotai's `tabular_learner`, and more specifically the TabularModel 🧵
The role of the `tabular_learner` is to mostly build a `TabularModel` for your data. This tabular model is a series of embedding matrices and some batch normalization, before going through a few rounds of LinBnDrop, as shown below 2/
What makes this model different from all other models that @fastdotai has is that it splits our inputs into **two** separate groups, the categorical and continuous, meaning the model expects a tuple:

3/
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(