The paper distills much of my thinking of the last 5 or 10 years about promising directions in AI.
It is basically what I'm planning to work on, and what I'm hoping to inspire others to work on, over the next decade. 2/N
Most people don't talk publicly about their research plans.
But I'm going beyond the spirit of Open Research by publishing ideas *before* the corresponding research is completed. 3/N
Topics addressed:
- An integrated, DL-based, modular, cognitive architecture.
- Using a world model and intrinsic cost for planning.
- Joint-Embedding Predictive Architecture (JEPA) as an architecture for world models that can handle uncertainty. 4/N
- Training JEPAs using non-contrastive Self-Supervised Learning.
- Hierarchical JEPA for prediction at multiple time scales.
- H-JEPAs can be used for hierarchical planning in which higher levels set objectives for lower levels.
- A configurable world model that can be tailored to the task at hand. 6/N
I express some of my opinions on the best path forward towards AI:
- scaling is necessary but not sufficient
- reward is not enough. Learning world models by observation-based SSL and the use of (differentiable) intrinsic objectives are required for sample-efficient learning.
7/N
- reasoning and planning comes down to inference: finding a sequence of actions and latent variables that minimize a (differentiable) objective. This is an answer to the question of making reasoning compatible with gradient-based learning.
8/N
- In that setting, explicit mechanisms for symbol manipulation are probably unnecessary
9/N
Many of the ideas in this proposal are not new and not mine.
But I've tried to integrate them into coherent architecture.
I probably missed a lot of relevant references and would appreciate any literature pointer.
10/N
I have communicated about the content of this paper over the last few months:
- Blog post: ai.facebook.com/blog/yann-lecu…
- Talk hosted by Baidu:
11/N
About the raging debate regarding the significance of recent progress in AI, it may be useful to (re)state a few obvious facts:
(0) there is no such thing as AGI. Reaching "Human Level AI" may be a useful goal, but even humans are specialized.
1/N
(1) the research community is making *some* progress towards HLAI (2) scaling up helps. It's necessary but not sufficient, because.... (3) we are still missing some fundamental concepts
2/N
(4) some of those new concepts are possibly "around the corner" (e.g. generalized self-supervised learning) (5) but we don't know how many such new concepts are needed. We just see the most obvious ones. (6) hence, we can't predict how long it's going to take to reach HLAI.
3/N
Researchers in speech recognition, computer vision, and natural language processing in the 2000s were obsessed with accurate representations of uncertainty.
1/N
This led to a flurry of work on probabilstic generative models such as Hidden Markov Models in speech, Markov random fields and constellation models in vision, and probabilistic topic models in NLP, e.g. with latent Dirichlet analysis.
2/N
There were debates at computer vision workshops about "generative models vs discriminative models". There were heroic-yet-futile attempts to build object recognition systems with non-parametric Bayesian methods.
3/N
"VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning" by Adrien Bardes, Jean Ponce, & Yann LeCun.
Accepted at ICLR 2022.
OpenReview/ICLR (camera-ready version + reviews): openreview.net/forum?id=xm6YD…
1/N
A simple method for Self-Supervised Learning of Joint-Embedding Architectures.
Basic idea: get 2 semantically similar inputs & train 2 networks to produce representations of the images that are (1) maximally informative about the input & (2) easy to predict from each other.
2/N
Objective function: 1. Variance: hinge loss maintains variance of each output component above a threshold (over the batch). 2. Invariance: make the two embeddings close to each other. 3. Covariance: decorrelate pairs of components of the each embeddings (over the batch).
3/N
ConvNeXt: the debate heats up between ConvNets and Transformers for vision!
Very nice work from FAIR+BAIR colleagues showing that with the right combination of methods, ConvNets are better than Transformers for vision.
87.1% top-1 ImageNet-1k arxiv.org/abs/2201.03545 1/N
Some of the helpful tricks make complete sense: larger kernels, layer norm, fat layer inside residual blocks, one stage of non-linearity per residual block, separate downsampling layers.... 2/N
"Learning in High Dimension Always Amounts to Extrapolation"
by Randall Balestriero, Jerome Pesenti, and Yann LeCun. arxiv.org/abs/2110.09485
Thread 1/N
1. given a function on d-dimensional vertors known solely by its value on N training samples, interpolation is defined as estimating the value of the function on a new sample inside the convex hull of the N vectors.
2/N
2. under mild assumptions, a new sample has a very low probability of being inside the convex hull of training samples, *unless* N grows exponentially with d.
3/N
There were two patents on ConvNets: one for ConvNets with strided convolution, and one for ConvNets with separate pooling layers.
They were filed in 1989 and 1990 and allowed in 1990 and 1991. 1/N
We started working with a development group that built OCR systems from it. Shortly thereafter, AT&T acquired NCR, which was building check imagers/sorters for banks. Images were sent to humans for transcription of the amount. Obviously, they wanted to automate that.
2/N
A complete check reading system was eventually built that was reliable enough to be deployed.
Commercial deployment in banks started in 1995.
The system could read about half the checks (machine printed or handwritten) and sent the other half to human operators.
3/N