Excited to go to NeurIPS conference tomorrow! It's an annual gala for AI. Many revolutionary ideas debuted here, like AlexNet, Transformer, & GPT-3.
I read *all 15* Outstanding Papers and can’t wait to share my thoughts with you all. Here's your front row seat to the festival:🧵
For each paper, I’ll give a TLDR and a note on why I think it’s significant. I may also link any interesting blogs and websites that dive in with greater depth. Original authors are welcome to chime in and expand the discussion or correct any mistakes! Tweet index is by paper.
Training Compute-Optimal Large Language Models. Hoffmann et al, @DeepMind. TLDR: introduces a new 70B LM called "Chinchilla”🐭 that outperforms much bigger LMs (GPT-3, Gopher). To be compute-optimal, model size and training data must be scaled equally. 1.1/
Chinchilla’s discoveries are profound. It shows that most LLMs are severely starved of data and under-trained. Given the new scaling law, even if you pump a quadrillion parameters into a model (GPT-4 urban myth), the gains will not compensate for 4x more training tokens 😱 1.2/
Is this why OpenAI created the “Whisper” speech recognition system, so they can feed GPT-4 with another trillion text tokens harvested from YouTube audio? I guess we’ll find out soon!
Chinchilla paper: openreview.net/forum?id=iBBcR…
Fantastic blog post: lesswrong.com/posts/6Fpvch8R…
1.3/
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Saharia et al, @GoogleAI. TLDR: “Imagen” is a large text-to-image and super-resolution diffusion model that generates beautiful photorealistic images. Beats DALLE-2 (May 2022) in human rating. 2.1/
The biggest advancement over DALLE2 is the use of a much stronger text encoder (T5-XXL) trained on enormous text corpus. Though DALLE’s CLIP text encoder is pixel-aware, it is not as good as T5 in terms of language understanding. This results in better image-text alignment. 2.2/
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Deitke et al, @allen_ai. TLDR: ProcTHOR is a simulator that procedurally generates a large variety of interactive, customizable, and physics-enabled houses for training embodied agents. Huge open asset library! 3.1/
Like Chinchilla, embodied agent research also needs a ton of diverse data to scale. An agent generates its own experience data via interaction & exploration. Its abilities are upper-bounded by the simulator complexity. ProcTHOR offers a scalable way to enrich the experience 3.2/
@allen_ai created a number of sims for household robotics before, such as AI2THOR and ManipulaTHOR. I also want to highlight the BEHAVIOR project from @drfeifei’s lab, an initiative to massively scale up the number of household tasks; and the Habitat platform from @MetaAI 3.3/
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. Fan et al, @nvidia. I’m the lead author of this paper. We propose a 3-ingredient recipe for building “embodied GPT-3” that can perceive and act in an infinite world. Intro thread👇 4.1/
LAION-5B: An open large-scale dataset for training next generation image-text models. Schuhmann et al, @laion_ai. TLDR: an open dataset of 5.85 billion CLIP-filtered image-text pairs to democratize multimodal foundation models. It’s by far the largest such public dataset! 5.1/
Yes, this is the dataset that gives birth to Stable Diffusion @StabilityAI! SD is an intelligent compressor of all human art, and a magic lamp that wakes the yearning artist inside all of us. SD has recently hit V2 and got smarter thanks to LAION-5B! Thread from @hardmaru👇 5.2/
Beyond neural scaling laws: beating power law scaling via data pruning. Sorscher et al, @StanfordAILab@MetaAI. TLDR: it is possible to vastly outperform the neural scaling law by carefully choosing the training examples instead of mindlessly collecting more data at random. 6.1/
The scaling law is actually highly unsustainable - a tiny drop in test error may require 10x more compute and energy. This paper lays out recipes to achieve *exponential* scaling instead through statistical mechanics theory. Carefully curating a small subset goes a long way! 6.2/
Gradient Descent: The Ultimate Optimizer. Chandra et al, @MIT_CSAIL@MetaAI. TLDR: don’t want to tune hyperparameters (learning rate, momentum)? SGD can become hyper-SGD and tune those for you! How about hyper-hyperparameters then? Then hyper-hyper-SGD 🤯! Can be ∞ stacked! 7.1/
Introduces a wicked smart way to repurpose existing auto-diff engines to compute hypergradients, even for complex ones like the momentum beta params of Adam! It’s so simple and elegant that HyperSGD can *recursively* call itself to be ever more robust to initialization. 7.2/
Using natural language and program abstractions to
instill human inductive biases in machines. Kumar et al, @Princeton. TLDR: training agents to predict representations from natural language descriptions and induced programs will guide them towards more human-like behaviors. 8.1/
This paper shows that language and programs can be used as repositories of abstract prior knowledge of humans. An agent can harvest these inductive biases in a meta-RL setting. Very interesting human studies and comparison with synthetic data devoid of human prior. 8.2/
Elucidating the Design Space of Diffusion-Based Generative Models. Karras et al, @nvidia. TLDR: presents a surgical analysis of different components in the training pipeline of diffusion models, and derives new techniques to improve generation results significantly. 9.1/
There are many best practices recommended by the paper. For example, a new sampling procedure that greatly reduces the number of sampling steps during synthesis, an improved distribution of noise levels during training, and other useful tricks like non-leaking augmentation. 9.2/
A Neural Corpus Indexer for Document Retrieval. Wang et al, @MSFTResearch. TLDR: Neural Corpus Indexer is a new seq2seq model that generates relevant document identifiers directly for a specific query, and significantly improves information retrieval performance. 10.1/
Traditional retrieval systems are based on vector embeddings of docs and nearest neighbor search. This work demos an end-to-end differentiable model that greatly simplifies the search pipeline, and has the potential to unify retrieval, ranking, and QA in a single framework. 10.2/
On-Demand Sampling: Learning Optimally from Multiple Distributions. Haghtalab et al, @berkeley_ai. TLDR: this paper aims to design the most efficient algorithm to sample from multiple distributions, with strong theoretical complexity improvements over prior works. 11.1/
Multi-distribution learning has important applications in ML fairness (e.g. socio-economically diverse populations), federated learning, and multi-agent collaboration. The optimal algorithm should sample on demand, since the distributions may be imbalanced or overlapping. 11.2/
A very theory-dense paper that shows optimality across different multi-distribution paradigms: openreview.net/forum?id=FR289…
11.3/
Is Out-of-Distribution Detection Learnable? Fang et al, many affiliations. TLDR: unfortunately, OOD detection is impossible to learn under some conditions - but the good news is these conditions may not hold in some practical scenarios! 12.1/
Our familiar supervised learning assumes that the test data is in-distribution, but the real world is messy and gives us OOD data at deployment. When is OOD detectable, and under what conditions? This paper uses the PAC learning theory to answer these questions rigorously. 12.2/
Riemannian Score-Based Generative Modelling. Bortoli et al, PSL University Paris, @UniofOxford. TLDR: introduces Riemannian Score-based Generative Models (RSGMs), a class of generative models extending SGMs to Riemannian manifolds (as opposed to data in Euclidean space). 13.1/
Diffusion models have taken generative AI by storm, but most assume a flat manifold. In domains like robotics, geoscience, or protein folding, the data is better described on Riemannian manifolds. Thanks to this paper, we may soon have Stable Diffusion for climate science! 13.2/
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. Arous et al, @nyuniversity@berkeley_ai@UWaterloo. TLDR: this paper investigates the scaling limits of stochastic gradient descent with constant stepsize in the high-dimensional regime. 14.1/
The core contribution is to develop a unified approach to the scaling limits of SGD in high-dimensions with constant learning rate that allows us to understand a broad range of estimation tasks. 14.2/
Gradient Estimation with Discrete Stein Operators. Shi et al, @StanfordAILab@Tsinghua_Uni@DeepMind@MSFTResearch. TLDR: introduces a variance reduction technique based on Stein operators for discrete distributions; greatly improves the quality of gradient estimation. 15.1/
Discrete variables make a neural network non-differentiable. A common workaround to estimate gradient is REINFORCE, but it suffers from high variance. This paper develops a high-performance method “RODEO” that augments REINFORCE with control variates from Stein operators. 15.2/
Lastly, 🥁 drum roll 🥁 - the Test of Time Award goes to … AlexNet! No surprise there. AlexNet is the reason I have a job today, the bread & butter that keeps a field fed🍞. All hail Alex!
No one explains the deep learning trend better than @karpathy:
That’s the end of our whirlwind tour of all 15 Outstanding Papers. Thanks for staying with me through this mega-thread! I plan to write more deep dives into the latest AI tsunami, so welcome you to follow 🙌! Happy NeurIPS Festival🥂!
END/🧵
• • •
Missing some Tweet in this thread? You can try to
force a refresh
GPT3 is powerful but blind. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? In our NeurIPS Outstanding Paper “MineDojo”, we provide a blueprint for this future:🧵
We argue that there are 3 main ingredients for generalist agents to emerge. First, an open-ended environment that allows an unlimited variety of tasks and goals. Earth is one example, as it is rich enough to forge an ever-expanding tree of life forms and behaviors. What else? 2/
Second, a large-scale knowledge base that teaches an AI not only *how* to do things, but also *what* are the useful things to do. GPT-3 learns from web text alone, but can we give our agent much richer data, such as video walkthroughs, multimedia tutorials, and free-form wiki? 3/
Today a 120B model called “Galactica” is open-sourced by @paperswithcode. It’s capable of writing math notations, citations, code, chemical formula, DNA, etc. Here’s why I think Galactica is a huge milestone in open foundation models, scientific automation, and responsible AI: 🧵
Large language models have personalities. They are not shaped by the architecture, but by the training data. Models like GPT-3 and OPT are trained on texts scraped from the internet at large, which unfortunately contains lots of irrelevant, misinformed, or toxic contents. 2/🧵
In contrast, scientific texts like academic papers are mostly immune from these data plagues. They contain analytical text with a neutral tone, knowledge backed by evidence, and are written by people who wish to inform rather than inflame. A dataset born in the ivory tower. 3/🧵
We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!🧵
We envision that a generalist robot agent should have an intuitive and expressive interface for task specification, but text alone is not enough. We introduce a novel multimodal prompting framework that converts a wide spectrum of robotic tasks into one sequence modeling problem.
Our VIMA model (reads “v-eye-ma”) consists of a pre-trained T5 to encode multimodal prompts, and a transformer decoder to predict robot arm commands autoregressively. The decoder has alternating self- and cross-attention layers conditioned on the prompt.