Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Luke Metz

@Luke_Metz

Sep 24, 2020 • 8 tweets • 3 min read • Read on X

Scrolly

We have a new paper on learned optimizers! We used thousands of tasks (and a lot of compute 😬) to train general purpose learned optimizers that perform well on never-before-seen tasks, and can even train new versions of themselves.
arxiv.org/abs/2009.11243
1/8

In the same way learned features took over computer vision, we believe ML algorithms will be replaced with learned components.

We shift away from hand designed optimizers (SGD, Adam) to learned optimizers parameterized by neural nets and trained to optimize neural nets.
2/8

We explore a new learned optimizer architecture: a hierarchical LSTM. It has access to both training loss and validation loss of the target task, which allows for dynamic regularization.
3/8

We find the number of tasks we train the learned optimizer on to be critical. More tasks leads to better optimizers and we ultimately train on a dataset of ~6k tasks.
4/8

The resulting learned optimizer, which requires no hyper parameter tuning, outperforms modestly tuned hand design methods on the majority of our tasks.
5/8

On larger scale tasks, these optimizers have comparable performance to learning rate tuned adam/momentum despite never seeing similar tasks at outer-training time. For example, below is a small ResNet on CIFAR-10.
6/8

In my favorite experiment, we show how general these methods are by using them to train new versions of themselves!

(This is similar to self-hosting compiles -- compilers which are written in the language that they compile.)
7/8

@niru_m

Thanks to my wonderful collaborators: @niru_m , @bucketofkets , @poolio, @jaschasd 🙏
8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Luke_Metz

Luke Metz

@Luke_Metz

Nov 18, 2022

Tired of having to manually tune optimizers? We’re excited to release VeLO, the first hparam-free, super versatile learned optimizer that outperforms hand-designed optimizers on real world problems. It was trained on thousands of TPU months of compute. 1/N arxiv.org/abs/2211.09760

VeLO is a learned optimizer. Instead of designing an update rule by hand as commonly done (e.g. Adam, SGD), VeLO is a tiny neural network that takes in gradient values, and outputs weight updates.

To train the weights of VeLO, we apply it to around 17 billion different small scale tasks, using (approx) gradient descent to find the lowest loss across all these tasks. This takes around 40 days with as much compute as we could get our hands on scattered across the globe.

Read 14 tweets

Luke Metz

@Luke_Metz

Nov 11, 2021

New paper: when to use gradients

arxiv.org/abs/2111.05803

DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not.

We explore why.

1/7

We show that when computing a gradient through an iterative system, we need to compute terms which consist of a product of the state transition Jacobian. This product is what causes issues.

If the Jacobian's eigenvalues are > 1, gradients explode. < 1, gradients vanish 😱

2/7

We demonstrate exploding gradients in physics simulation, molecular dynamics, and learned optimization.

In the absence of noise, the loss surface can be high curvature, causing large gradients. While averaging smooths the loss, the grad variance still grows exponentially.

3/7

Read 7 tweets

Luke Metz

@Luke_Metz

Mar 13, 2020

Excited to share our new work! We introduce a dataset of tasks for learned optimizer research. As an example application of this dataset we meta-train lists of optimizer hyper parameters that work well on a diverse set of tasks. arxiv.org/abs/2002.11887 1/4

We are releasing these lists of optimizer hyperparameters in TF, PyTorch, and Jax as a drop in replacement for existing optimizes. Give it a try and let us know how it goes! github.com/google-researc… 2/4

Finally, we have open sourced the code for the tasks as well as learning curves for ~29 million models trained with different optimizers and hyper parameters.
github.com/google-researc… 3/4

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Luke Metz

Try unrolling a thread yourself!

More from @Luke_Metz

Luke Metz

Luke Metz

Luke Metz

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!