Tweet

Jim (Linxi) Fan @NeurIPS

Nov 27 • 49 tweets • 23 min read

Excited to go to NeurIPS conference tomorrow! It's an annual gala for AI. Many revolutionary ideas debuted here, like AlexNet, Transformer, & GPT-3.
I read *all 15* Outstanding Papers and can’t wait to share my thoughts with you all. Here's your front row seat to the festival:🧵

For each paper, I’ll give a TLDR and a note on why I think it’s significant. I may also link any interesting blogs and websites that dive in with greater depth. Original authors are welcome to chime in and expand the discussion or correct any mistakes! Tweet index is by paper.

@DeepMind

Training Compute-Optimal Large Language Models. Hoffmann et al, @DeepMind. TLDR: introduces a new 70B LM called "Chinchilla”🐭 that outperforms much bigger LMs (GPT-3, Gopher). To be compute-optimal, model size and training data must be scaled equally. 1.1/

Chinchilla’s discoveries are profound. It shows that most LLMs are severely starved of data and under-trained. Given the new scaling law, even if you pump a quadrillion parameters into a model (GPT-4 urban myth), the gains will not compensate for 4x more training tokens 😱 1.2/

Is this why OpenAI created the “Whisper” speech recognition system, so they can feed GPT-4 with another trillion text tokens harvested from YouTube audio? I guess we’ll find out soon!
Chinchilla paper: openreview.net/forum?id=iBBcR…
Fantastic blog post: lesswrong.com/posts/6Fpvch8R…
1.3/

@GoogleAI

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Saharia et al, @GoogleAI. TLDR: “Imagen” is a large text-to-image and super-resolution diffusion model that generates beautiful photorealistic images. Beats DALLE-2 (May 2022) in human rating. 2.1/

The biggest advancement over DALLE2 is the use of a much stronger text encoder (T5-XXL) trained on enormous text corpus. Though DALLE’s CLIP text encoder is pixel-aware, it is not as good as T5 in terms of language understanding. This results in better image-text alignment. 2.2/

I’m still looking forward to a public portal to play with Imagen myself!
Paper: openreview.net/forum?id=08Yk-… Website with lots of fancy images: imagen.research.google
2.3/

@allen_ai

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Deitke et al, @allen_ai. TLDR: ProcTHOR is a simulator that procedurally generates a large variety of interactive, customizable, and physics-enabled houses for training embodied agents. Huge open asset library! 3.1/

Like Chinchilla, embodied agent research also needs a ton of diverse data to scale. An agent generates its own experience data via interaction & exploration. Its abilities are upper-bounded by the simulator complexity. ProcTHOR offers a scalable way to enrich the experience 3.2/

@allen_ai

@allen_ai created a number of sims for household robotics before, such as AI2THOR and ManipulaTHOR. I also want to highlight the BEHAVIOR project from @drfeifei’s lab, an initiative to massively scale up the number of household tasks; and the Habitat platform from @MetaAI 3.3/

Paper: openreview.net/forum?id=4-bV1…
Website with cool interactive demos: procthor.allenai.org
3.4/

@nvidia

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. Fan et al, @nvidia. I’m the lead author of this paper. We propose a 3-ingredient recipe for building “embodied GPT-3” that can perceive and act in an infinite world. Intro thread👇 4.1/

https://twitter.com/drjimfan/status/1595459499732926464

We open-source *everything*: simulation suite, database, algorithm code, pretrained models, and even annotation tools!
Website: minedojo.org
Paper: neurips.cc/virtual/2022/p…
Arxiv: arxiv.org/abs/2206.08853
Code, models, tools: github.com/MineDojo
4.2/

@laion_ai

LAION-5B: An open large-scale dataset for training next generation image-text models. Schuhmann et al, @laion_ai. TLDR: an open dataset of 5.85 billion CLIP-filtered image-text pairs to democratize multimodal foundation models. It’s by far the largest such public dataset! 5.1/

@StabilityAI

Yes, this is the dataset that gives birth to Stable Diffusion @StabilityAI! SD is an intelligent compressor of all human art, and a magic lamp that wakes the yearning artist inside all of us. SD has recently hit V2 and got smarter thanks to LAION-5B! Thread from @hardmaru👇 5.2/

https://twitter.com/hardmaru/status/1595590056517206016

Blog: laion.ai/blog/laion-5b/
Paper: openreview.net/forum?id=M3Y74…
Stable Diffusion V2 release, very exciting stuff: stability.ai/blog/stable-di…
5.3/

@StanfordAILab

Beyond neural scaling laws: beating power law scaling via data pruning. Sorscher et al, @StanfordAILab @MetaAI. TLDR: it is possible to vastly outperform the neural scaling law by carefully choosing the training examples instead of mindlessly collecting more data at random. 6.1/

The scaling law is actually highly unsustainable - a tiny drop in test error may require 10x more compute and energy. This paper lays out recipes to achieve *exponential* scaling instead through statistical mechanics theory. Carefully curating a small subset goes a long way! 6.2/

@SuryaGanguli

Paper: openreview.net/forum?id=UmvSl…
Excellent thread from author @SuryaGanguli:

https://twitter.com/SuryaGanguli/status/1542599453659451392?s=20&t=ePiD5WWPuA9zQrk3lnqAiw

and @arimorcos:

https://twitter.com/arimorcos/status/1544371490769973248?s=20&t=MwDW191pX8lY03c48ZtowQ

6.3/

@MIT_CSAIL

Gradient Descent: The Ultimate Optimizer. Chandra et al, @MIT_CSAIL @MetaAI. TLDR: don’t want to tune hyperparameters (learning rate, momentum)? SGD can become hyper-SGD and tune those for you! How about hyper-hyperparameters then? Then hyper-hyper-SGD 🤯! Can be ∞ stacked! 7.1/

Introduces a wicked smart way to repurpose existing auto-diff engines to compute hypergradients, even for complex ones like the momentum beta params of Adam! It’s so simple and elegant that HyperSGD can *recursively* call itself to be ever more robust to initialization. 7.2/

Paper: openreview.net/forum?id=-Qp-3…
PyTorch implementation: github.com/kach/gradient-…
7.3/

@Princeton

Using natural language and program abstractions to
instill human inductive biases in machines. Kumar et al, @Princeton. TLDR: training agents to predict representations from natural language descriptions and induced programs will guide them towards more human-like behaviors. 8.1/

This paper shows that language and programs can be used as repositories of abstract prior knowledge of humans. An agent can harvest these inductive biases in a meta-RL setting. Very interesting human studies and comparison with synthetic data devoid of human prior. 8.2/

Paper: openreview.net/forum?id=buXZ7…
8.3/

@nvidia

Elucidating the Design Space of Diffusion-Based Generative Models. Karras et al, @nvidia. TLDR: presents a surgical analysis of different components in the training pipeline of diffusion models, and derives new techniques to improve generation results significantly. 9.1/

There are many best practices recommended by the paper. For example, a new sampling procedure that greatly reduces the number of sampling steps during synthesis, an improved distribution of noise levels during training, and other useful tricks like non-leaking augmentation. 9.2/

Paper: openreview.net/forum?id=k7FuT…
Code: github.com/NVlabs/edm
9.3/

@MSFTResearch

A Neural Corpus Indexer for Document Retrieval. Wang et al, @MSFTResearch. TLDR: Neural Corpus Indexer is a new seq2seq model that generates relevant document identifiers directly for a specific query, and significantly improves information retrieval performance. 10.1/

Traditional retrieval systems are based on vector embeddings of docs and nearest neighbor search. This work demos an end-to-end differentiable model that greatly simplifies the search pipeline, and has the potential to unify retrieval, ranking, and QA in a single framework. 10.2/

Paper: openreview.net/forum?id=fSfcE…
10.3/

@berkeley_ai

On-Demand Sampling: Learning Optimally from Multiple Distributions. Haghtalab et al, @berkeley_ai. TLDR: this paper aims to design the most efficient algorithm to sample from multiple distributions, with strong theoretical complexity improvements over prior works. 11.1/

Multi-distribution learning has important applications in ML fairness (e.g. socio-economically diverse populations), federated learning, and multi-agent collaboration. The optimal algorithm should sample on demand, since the distributions may be imbalanced or overlapping. 11.2/

A very theory-dense paper that shows optimality across different multi-distribution paradigms: openreview.net/forum?id=FR289…
11.3/

Is Out-of-Distribution Detection Learnable? Fang et al, many affiliations. TLDR: unfortunately, OOD detection is impossible to learn under some conditions - but the good news is these conditions may not hold in some practical scenarios! 12.1/

Our familiar supervised learning assumes that the test data is in-distribution, but the real world is messy and gives us OOD data at deployment. When is OOD detectable, and under what conditions? This paper uses the PAC learning theory to answer these questions rigorously. 12.2/

Paper: openreview.net/forum?id=sde_7…
12.3/

@UniofOxford

Riemannian Score-Based Generative Modelling. Bortoli et al, PSL University Paris, @UniofOxford. TLDR: introduces Riemannian Score-based Generative Models (RSGMs), a class of generative models extending SGMs to Riemannian manifolds (as opposed to data in Euclidean space). 13.1/

Diffusion models have taken generative AI by storm, but most assume a flat manifold. In domains like robotics, geoscience, or protein folding, the data is better described on Riemannian manifolds. Thanks to this paper, we may soon have Stable Diffusion for climate science! 13.2/

Paper: openreview.net/forum?id=oDRQG…
13.3/

@nyuniversity

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. Arous et al, @nyuniversity @berkeley_ai @UWaterloo. TLDR: this paper investigates the scaling limits of stochastic gradient descent with constant stepsize in the high-dimensional regime. 14.1/

The core contribution is to develop a unified approach to the scaling limits of SGD in high-dimensions with constant learning rate that allows us to understand a broad range of estimation tasks. 14.2/

Paper: openreview.net/forum?id=Q38D6…
14.3/

@StanfordAILab

Gradient Estimation with Discrete Stein Operators. Shi et al, @StanfordAILab @Tsinghua_Uni @DeepMind @MSFTResearch. TLDR: introduces a variance reduction technique based on Stein operators for discrete distributions; greatly improves the quality of gradient estimation. 15.1/

Discrete variables make a neural network non-differentiable. A common workaround to estimate gradient is REINFORCE, but it suffers from high variance. This paper develops a high-performance method “RODEO” that augments REINFORCE with control variates from Stein operators. 15.2/

Paper: openreview.net/forum?id=I1mkU…
15.3/

@karpathy

Lastly, 🥁 drum roll 🥁 - the Test of Time Award goes to … AlexNet! No surprise there. AlexNet is the reason I have a job today, the bread & butter that keeps a field fed🍞. All hail Alex!
No one explains the deep learning trend better than @karpathy:

https://twitter.com/karpathy/status/1468370605229547522?s=20&t=KSJZEeP8DlAzlHYofhKy6Q

16/

That’s the end of our whirlwind tour of all 15 Outstanding Papers. Thanks for staying with me through this mega-thread! I plan to write more deep dives into the latest AI tsunami, so welcome you to follow 🙌! Happy NeurIPS Festival🥂!
END/🧵

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Jim (Linxi) Fan @NeurIPS

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @DrJimFan

Jim (Linxi) Fan @NeurIPS

Jim (Linxi) Fan @NeurIPS

Jim (Linxi) Fan @NeurIPS

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!