François Chollet Profile picture
Feb 17 6 tweets 3 min read Read on X
The "aha" moment when I realized that curve-fitting was the wrong paradigm for achieving generalizable modeling of problems spaces that involve symbolic reasoning was in early 2016.

I was trying every possible way to get a LSTM/GRU based model to classify first-order logic statements, and each new attempt was showing a bit more clearly than the last that my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions.

It has been fascinating to see this observation echo again and again over the past 8 years.
From 2013 to 2016 I was actually quite convinced that RNNs could be trained to learn any program. After all, they're Turing-complete (or at least some of them are) and they learn a highly compressed model of the input:output mapping they're trained on (rather than mere pointwise associations). Surely they could perform symbolic program synthesis in some continuous latent program space?

Nope. They do in fact learn mere pointwise associations and completely useless for program synthesis. The problem isn't with what the function space can represent -- the problem is the learning process. It's SGD.
Ironically, Transformers are even worse in that regard -- mostly due to their strongly interpolative architecture prior. Multi-head-attention literally hardcodes sample interpolation in latent space. Also, the fact that recurrence is a really helpful prior for symbolic programs.
Not saying that Transformers are worse than RNNs, mind you -- Transformers are *the best* at *what deep learning does* (generalizing via interpolation), specifically *because* of their strongly interpolative architecture prior (MHA). They are, however, worse at learning symbolic programs (which RNNs also largely fail at anyway).
As pointed out by @sirbayes, this paper has a formal investigation into the observation from the first tweet -- that "my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions"

The paper concludes: "while models can attain near-perfect test accuracy on training distributions, they fail catastrophically on other distributions; we demonstrate that they have learned to exploit statistical features rather than to emulate the correct reasoning function"

Basically, SGD will always latch onto statistical correlations as a shortcut for fitting the training distribution, which prevents it from finding the generalizable form of the target program (the one that would operate outside of the training distribution), despite that program being part of the search space.starai.cs.ucla.edu/papers/ZhangIJ…
There are essentially two main options to remedy this:

1. Find ways to perform active inference, so that the model adapts its learned program in contact with a new data distribution at test time. Would likely lead to some meaningful progress, but it isn't the ultimate solution, more of an incremental improvement.

2. Change the training mechanism to something more robust than SGD, such as the MDL principle. This would pretty much require moving away from deep learning (curve fitting) altogether and embracing discrete program search instead (which I have advocated for many years as a way to tackle reasoning problems...)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with François Chollet

François Chollet Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @fchollet

May 14
It's amazing to me that the year is 2024 and some people still equate task-specific skill and intelligence. There is *no* specific task that cannot be solved *without* intelligence -- all you need a sufficiently complete description of the task (removing all test-time novelty and uncertainty), and you can achieve arbitrary levels of skills while entirely by-passing the problem of intelligence. In the limit, even a simple hashtable can be superhuman at anything.Image
The "AI" of today still has near-zero (though not exactly zero) intelligence, despite achieving superhuman skill at many tasks.

Here's one thing that AI won't be able to do within five years (if you extrapolate from the excruciatingly slow progress of the past 15 years): acquiring new skills as efficiently as humans, using the same data. The ARC benchmark is an attempt at measuring roughly that.
The point of general intelligence is to make it possible to deal with novelty and uncertainty, which is what our lives are made of. Intelligence is the ability to improvise and adapt in the face of situations you weren't prepared for (either by your evolutionary history or by your past experience) -- to efficiently acquire skills at novel tasks, on the fly.
Read 5 tweets
Apr 28
Many of the people who are concerned with falling birthrates aren't willing to consider the set policies that would address the problem -- aggressive tax breaks for families, free daycare, free education, free healthcare, and building more/denser housing to slash the price of homes.

Most people want children, but can't afford them.
I always found it striking how very rich couples (50M+ net worth) all tend to have over 3 children (and often many more). And how young women always say they want children -- yet in practice they delay family building because they are forced to focus on financial stability and therefore career. When money is not an object, families have 3+ children.
For middle incomes (below 1M/year) fertility goes down as income goes up, because *the cost of raising children increases with income* due to *opportunity cost*. If you make $150k and stand to eventually grow to $300k, you are losing a lot of money by quitting your job to raise children (on top of the prohibitive cost of raising children -- which also goes up as your incomes and thus standards go up). You are thus *more* likely to postpone having children.

Starting at 1M/year, fertility rates rise again. And couples that make 5+M/year get to have the number of children they actually want -- which is almost always more than 3, and quite often 5+.Image
Read 4 tweets
Mar 31
That memorization (which ML has solely focused on) is not intelligence. And because any task that does not involve significant novelty and uncertainty can be solved via memorization, *skill* is never a sign of intelligence, no matter the task.
Intelligence is found in the ability to pick up new skills quickly & efficiently -- at tasks you weren't prepared for. To improvise, adapt and learn.
Here's a paper you can read about it.

It introduced a formal definition of intelligence, as well as benchmark to capture that definition in practical terms. Although it was developed before the rise of LLMs, current state-of-the-art LLMs such as Gemini Ultra, Claude 3, or GPT-4 are not able to score higher than a few percents on that benchmark.arxiv.org/abs/1911.01547
Read 4 tweets
Mar 13
We benchmarked a range of popular models (SegmentAnything, BERT, StableDiffusion, Gemma, Mistral) with all Keras 3 backends (JAX/TF/PT). Key findings:

1. There's no "best" backend. The fastest backend often depends on your specific model architecture.

2. Keras 3 with the right backend is consistently a lot faster than reference PT (compiled) implementations. Often by 150%+.

3. Keras 3 models are fast without requiring any custom performance optimizations. It's all "stock" code.

4. Keras 3 is faster than Keras 2.

Details here: keras.io/getting_starte…
Finding 1: the fastest backend for a given model typically alternates between XLA-compiled JAX and XLA-compiled TF. Plus, you might want to debug/prototype in PT before training/inferencing with JAX or TF.

The ability to write framework-agnostic models and pick your backend later is a game-changer.Image
Finding 2: Keras 3 with the best-performing backend outperforms reference native PT implementations (compiled) for all models we tried.

Notably, 5 out of 10 tasks demonstrate speedups exceeding 100%, with a maximum speedup of 340%.

If you're not leveraging this advantage for any large model training run, you're wasting GPU time -- and thus throwing away money.Image
Read 6 tweets
Mar 12
It doesn't take a whole lot of pondering to figure out that the thesis "humans only seem smart because they're 'trained' on huge amounts of 'data' via their visual system (almost like LLMs!)" doesn't hold any water.

For instance -- congenitally blind people are not less intelligent. Vision isn't fundamental to what makes us human. A rich learning environment is still a rich learning environment when apprehended through restricted sensorimotor modalities.
Humans span an incredibly wide range of sensorimotor affordances. Some are blind, some are deaf, some don't have hands. They might grow up in radically different environments -- some with just three other humans around them, some with thousands. Some with libraries of books, some without any writing.

In the end, though, it doesn't make a huge difference -- all of them become fully-fledged, intelligent humans. Because no matter what, they're all extracting information from the world at a roughly constant rate: the intrinsic rate at which the brain processes information. Which is an infinitesimal fraction of the bandwidth of the human sensorimotor feed.

If your senses are missing something, you'll just report your fixed-rate attention to something else, and won't be much poorer for it.
That's also why the influence of genes on fluid intelligence is overwhelmingly greater than that of the environment. If "training data" was so important, you'd expect environment and education to be critical to intelligence. They aren't. Twins raised in vastly different situations end up about as smart.
Read 4 tweets
Feb 21
Thread: quick API overview of Gemma, the new open-source LLM by Google.

First, let's make sure you have the latest Keras and KerasNLP installed, and let's set up your Kaggle credentials, so you can download the assets from Kaggle. Image
Next, let's instantiate the model and generate some text. You have access to 2 different sizes, 2B & 7B, and 2 different versions per size: base & instruction-tuned.

The first call will download the weights. Image
I generally recommend running inference in float16 or bfloat16 (depending on the hardware you're using). You can either globally configure the dtype policy in Keras (do it before creating the model), or pass the `dtype` argument to your model.

Note that operations like softmax will use float32 regardless, for stability.Image
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(