Tom Goldstein Profile picture
May 2 7 tweets 3 min read Twitter logo Read on Twitter
It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD.
openreview.net/forum?id=QC10R…
An alternative theory of generalization is the "volume hypothesis": Good minima are flat, and occupy more volume than bad minima. For this reason, optimizers are more likely to land in the large/wide basins around good minima, and less likely to land in small/sharp bad minima. Image
One of the optimizers we test is a “guess and check” (GAC) optimizer that samples parameters uniformly from a hypercube and checks whether the loss is low. If so, optimization terminates. If not, it throws away the parameters and samples again until it finds a low loss.
The GAC optimizer is more likely to land in a big basin than in a small basin. In fact, the probability of landing in a region is *exactly* proportional to its volume. The success of GAC shows that volume alone can explain generalization without appealing to optimizer dynamics.
One weakness of our experiment is that these zeroth-order optimizers only work with small datasets. Still, I hope it can further the notion that loss landscape geometry plays a role in generalization.

Come see our poster (#87) Tuesday at 11:30am, or our talk (track 4, 10:50am).
PS: Other evidence that SGD regularization is not needed includes the observation that generalization happens with non-stochastic gradient descent (openreview.net/forum?id=ZBESe…), and that bad minima have very small volumes (arxiv.org/abs/1906.03291).

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Goldstein

Tom Goldstein Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tomgoldsteincs

Mar 13
Here's the real story of #SiliconValleyBank, as told the boring way through tedious analysis of balance sheets and SEC filings 🧵
Throughout 2021 startups were raising money from VCs and stashing it in SVB. Deposits increased from $102B to $189B. That's an 85% change in one year. Wow! Image
Most news sources claim that SVB stashed this money in relatively safe treasury securities. This is something important that most media sources got wrong.
forbes.com/sites/billcone… Image
Read 13 tweets
Feb 27
If you work for a US university, you have probably noticed the rollout of strict new policies mandating disclosures and approvals for funding, consulting, and COIs, and also threats of legal action for non-compliance. Here’s why this is happening now 🧵
Let's start at the beginning. In 2018, the DOJ implemented its new “China Policy.” The stated purpose of this program was to combat the perceived fears of Chinese espionage operations inside US Universities.
fbi.gov/investigate/co…
In practice, the DOJ used the policy to investigate people of Chinese descent, usually without evidence of espionage. Many people were arrested and jailed with no formal charges at all.
reuters.com/world/us/trump…
Read 15 tweets
Feb 8
We rack our brains making prompts for #StableDiffusion and Language Models. But a lot of prompt engineering can be done *automatically* using simple gradient-based optimization. And the cold calculating efficiency of the machine crushes human creativity.
Prompts made easy (PEZ) is a gradient optimizer for text. It can convert images into prompts for Stable Diffusion, or it can learn a hard prompt for an LLM task. The method uses ideas from the binary neural nets literature that mashup continuous and discrete optimization.
PEZ can even create a prompt to represent a face...as the hypothetical offspring of multiple celebrities ¯\_(ツ)_/¯
Read 6 tweets
Jan 25
#OpenAI is planning to stop #ChatGPT users from making social media bots and cheating on homework by "watermarking" outputs. How well could this really work? Here's just 23 words from a 1.3B parameter watermarked LLM. We detected it with 99.999999999994% confidence. Here's how 🧵
This article, and a blog post by Scott Aaronson, suggest that OpenAI will deploy something similar to what I describe. The watermark below can be detected using an open source algorithm with no access to the language model or its API.
businessinsider.com/openai-chatgpt…
Language models generate text one token at a time. Each token is selected from a “vocabulary” with about 50K words. Before each new token is generated, we imprint the watermark by first taking the most recent token and using it to seed a random number generator (RNG).
Read 12 tweets
Dec 6, 2022
How many GPUs does it take to run ChatGPT? And how expensive is it for OpenAI? Let’s find out! 🧵🤑
We don’t know the exact architecture of ChatGPT, but OpenAI has said that it is fine-tuned from a variant of GPT-3.5, so it probably has 175B parameters. That's pretty big.
How fast could it run? A 3-billion parameter model can generate a token in about 6ms on an A100 GPU (using half precision+tensorRT+activation caching). If we scale that up to the size of ChatGPT, it should take 350ms secs for an A100 GPU to print out a single word.
Read 10 tweets
Nov 25, 2022
Neural algorithm synthesis is done by giving models a human-crafted programming language and millions of sample programs. Recently, my lab looked at whether neural networks can synthesize algorithms on their own without these crutches. They can, with the right architecture. 🧵
Here's an algorithmic reasoning problem where standard nets fail. We train resnet18 to solve little 13x13 mazes. It accepts a 2D image of a maze and spits out a 2D image of the solution. Resnet18 gets 100% test acc on unseen mazes of the same size. But something is wrong…
If we test the same network on a larger maze it totally fails. The network memorized *what* maze solutions look like, but it didn’t learn *how* to solve mazes.

We can make the model synthesize a scalable maze-solving algorithm just by changing its architecture...
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(