It is widely thought that neural networks generalize because of implicit regularization of gradient descent. Today at #ICLR2023 we show new evidence to the contrary. We train with gradient-free optimizers and observe generalization competitive with SGD. openreview.net/forum?id=QC10R…
An alternative theory of generalization is the "volume hypothesis": Good minima are flat, and occupy more volume than bad minima. For this reason, optimizers are more likely to land in the large/wide basins around good minima, and less likely to land in small/sharp bad minima.
One of the optimizers we test is a “guess and check” (GAC) optimizer that samples parameters uniformly from a hypercube and checks whether the loss is low. If so, optimization terminates. If not, it throws away the parameters and samples again until it finds a low loss.
The GAC optimizer is more likely to land in a big basin than in a small basin. In fact, the probability of landing in a region is *exactly* proportional to its volume. The success of GAC shows that volume alone can explain generalization without appealing to optimizer dynamics.
One weakness of our experiment is that these zeroth-order optimizers only work with small datasets. Still, I hope it can further the notion that loss landscape geometry plays a role in generalization.
Come see our poster (#87) Tuesday at 11:30am, or our talk (track 4, 10:50am).
PS: Other evidence that SGD regularization is not needed includes the observation that generalization happens with non-stochastic gradient descent (openreview.net/forum?id=ZBESe…), and that bad minima have very small volumes (arxiv.org/abs/1906.03291).
Here's the real story of #SiliconValleyBank, as told the boring way through tedious analysis of balance sheets and SEC filings 🧵
Throughout 2021 startups were raising money from VCs and stashing it in SVB. Deposits increased from $102B to $189B. That's an 85% change in one year. Wow!
Most news sources claim that SVB stashed this money in relatively safe treasury securities. This is something important that most media sources got wrong. forbes.com/sites/billcone…
If you work for a US university, you have probably noticed the rollout of strict new policies mandating disclosures and approvals for funding, consulting, and COIs, and also threats of legal action for non-compliance. Here’s why this is happening now 🧵
Let's start at the beginning. In 2018, the DOJ implemented its new “China Policy.” The stated purpose of this program was to combat the perceived fears of Chinese espionage operations inside US Universities. fbi.gov/investigate/co…
In practice, the DOJ used the policy to investigate people of Chinese descent, usually without evidence of espionage. Many people were arrested and jailed with no formal charges at all. reuters.com/world/us/trump…
We rack our brains making prompts for #StableDiffusion and Language Models. But a lot of prompt engineering can be done *automatically* using simple gradient-based optimization. And the cold calculating efficiency of the machine crushes human creativity.
Prompts made easy (PEZ) is a gradient optimizer for text. It can convert images into prompts for Stable Diffusion, or it can learn a hard prompt for an LLM task. The method uses ideas from the binary neural nets literature that mashup continuous and discrete optimization.
PEZ can even create a prompt to represent a face...as the hypothetical offspring of multiple celebrities ¯\_(ツ)_/¯
#OpenAI is planning to stop #ChatGPT users from making social media bots and cheating on homework by "watermarking" outputs. How well could this really work? Here's just 23 words from a 1.3B parameter watermarked LLM. We detected it with 99.999999999994% confidence. Here's how 🧵
This article, and a blog post by Scott Aaronson, suggest that OpenAI will deploy something similar to what I describe. The watermark below can be detected using an open source algorithm with no access to the language model or its API. businessinsider.com/openai-chatgpt…
Language models generate text one token at a time. Each token is selected from a “vocabulary” with about 50K words. Before each new token is generated, we imprint the watermark by first taking the most recent token and using it to seed a random number generator (RNG).
How many GPUs does it take to run ChatGPT? And how expensive is it for OpenAI? Let’s find out! 🧵🤑
We don’t know the exact architecture of ChatGPT, but OpenAI has said that it is fine-tuned from a variant of GPT-3.5, so it probably has 175B parameters. That's pretty big.
How fast could it run? A 3-billion parameter model can generate a token in about 6ms on an A100 GPU (using half precision+tensorRT+activation caching). If we scale that up to the size of ChatGPT, it should take 350ms secs for an A100 GPU to print out a single word.
Neural algorithm synthesis is done by giving models a human-crafted programming language and millions of sample programs. Recently, my lab looked at whether neural networks can synthesize algorithms on their own without these crutches. They can, with the right architecture. 🧵
Here's an algorithmic reasoning problem where standard nets fail. We train resnet18 to solve little 13x13 mazes. It accepts a 2D image of a maze and spits out a 2D image of the solution. Resnet18 gets 100% test acc on unseen mazes of the same size. But something is wrong…
If we test the same network on a larger maze it totally fails. The network memorized *what* maze solutions look like, but it didn’t learn *how* to solve mazes.
We can make the model synthesize a scalable maze-solving algorithm just by changing its architecture...