François Chollet Profile picture
Feb 17, 2024 6 tweets 3 min read Read on X
The "aha" moment when I realized that curve-fitting was the wrong paradigm for achieving generalizable modeling of problems spaces that involve symbolic reasoning was in early 2016.

I was trying every possible way to get a LSTM/GRU based model to classify first-order logic statements, and each new attempt was showing a bit more clearly than the last that my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions.

It has been fascinating to see this observation echo again and again over the past 8 years.
From 2013 to 2016 I was actually quite convinced that RNNs could be trained to learn any program. After all, they're Turing-complete (or at least some of them are) and they learn a highly compressed model of the input:output mapping they're trained on (rather than mere pointwise associations). Surely they could perform symbolic program synthesis in some continuous latent program space?

Nope. They do in fact learn mere pointwise associations and completely useless for program synthesis. The problem isn't with what the function space can represent -- the problem is the learning process. It's SGD.
Ironically, Transformers are even worse in that regard -- mostly due to their strongly interpolative architecture prior. Multi-head-attention literally hardcodes sample interpolation in latent space. Also, the fact that recurrence is a really helpful prior for symbolic programs.
Not saying that Transformers are worse than RNNs, mind you -- Transformers are *the best* at *what deep learning does* (generalizing via interpolation), specifically *because* of their strongly interpolative architecture prior (MHA). They are, however, worse at learning symbolic programs (which RNNs also largely fail at anyway).
As pointed out by @sirbayes, this paper has a formal investigation into the observation from the first tweet -- that "my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions"

The paper concludes: "while models can attain near-perfect test accuracy on training distributions, they fail catastrophically on other distributions; we demonstrate that they have learned to exploit statistical features rather than to emulate the correct reasoning function"

Basically, SGD will always latch onto statistical correlations as a shortcut for fitting the training distribution, which prevents it from finding the generalizable form of the target program (the one that would operate outside of the training distribution), despite that program being part of the search space.starai.cs.ucla.edu/papers/ZhangIJ…
There are essentially two main options to remedy this:

1. Find ways to perform active inference, so that the model adapts its learned program in contact with a new data distribution at test time. Would likely lead to some meaningful progress, but it isn't the ultimate solution, more of an incremental improvement.

2. Change the training mechanism to something more robust than SGD, such as the MDL principle. This would pretty much require moving away from deep learning (curve fitting) altogether and embracing discrete program search instead (which I have advocated for many years as a way to tackle reasoning problems...)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with François Chollet

François Chollet Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @fchollet

Mar 25
ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time.

We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions.

Meanwhile, all frontier AI reasoning models do under 1% at this time.
You can go play some of the environments yourself - 25 of them are now public: arcprize.org
You can also enter the ARC-AGI-3 competition on Kaggle. Your AI agents will be tested on two separate private test sets of 55 environments.

kaggle.com/competitions/a…
Read 4 tweets
Feb 21
Cloning any random piece of SaaS is something that could already be done before agentic coding, and the economics of it haven't changed meaningfully. Before, writing the clone would cost 0.5-1% of the valuation of the legacy SaaS company. Now it might be 0.1%. It doesn't make a difference -- if you can pull it off profitably today you could also have done it profitably in the past.

The code is a very small part of the process of making such a clone successful, and the reason legacy software has often bad UX is not because code was expensive to write.
Circa 2012 you had a lot of devs "cloning Twitter" as a weekend project. Reproducing the UI and features of any app was never difficult and was never particularly valuable.
Last I checked, Twitter is still around, despite having been cloned 10,000 times before. And IMO most legacy SaaS has even greater stickiness than a social network (which does have tremendous stickiness)
Read 4 tweets
Aug 21, 2025
People ask me, "didn't you say before ChatGPT that deep learning had hit a wall and there would be no more progress?"

I have never said this. I was saying the opposite (that scaling DL would deliver). You might be thinking of Gary Marcus.

My pre-ChatGPT position (below) was that scaling up DL would keep delivering better and better results, and *also* that it wasn't the way to AGI (as I defined it: human-level skill acquisition efficiency).

This was a deeply unpopular position at the time (neither AI skeptic nor AGI-via-DL-scaling prophet). It is now completely mainstream.
People also ask, "didn't you say in 2023 that LLMs could not reason?"

I have also never said this. I am on the record across many channels (Twitter, podcasts...) saying that "can LLMs reason?" was not a relevant question, just semantics, and that the more interesting question was, "could they adapt to novel tasks beyond what they had been trained on?" -- and that the answer was no.

Also correct in retrospect, and a mainstream position today.
I have been consistently bullish on deep learning since 2013, back when deep learning was maybe a couple thousands of people.

I have also been consistently bullish on scaling DL -- not as a way to achieve AGI, but as a way to create more useful models.
Read 4 tweets
Mar 24, 2025
Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with.

It keeps the same format as ARC-AGI-1, while significantly increasing the signal strength it provides about a system's actual fluid intelligence. Expect more novelty, less redundancy, and deeper levels of concept recombination. There's a lot more focus on probing abilities that are still missing from frontier reasoning systems, like on-the-fly symbol interpretation, multi-step compositional reasoning, and context-dependent rules.

ARC-AGI-2 is fully human-calibrated. We tested these tasks with 400 people in live sessions, and we only kept tasks that could reliably be solved by multiple people. Each eval set (public, private, semi-private) has the exact same human difficulty – average people in our test sample achieve 60% with no prior training, and a panel of 10 people achieve 100%.Image
ARC-AGI-2 dataset: github.com/arcprize/ARC-A…

Full details on the release: arcprize.org/blog/announcin…
In addition to the ARC-AGI-2 release, we're launching the ARC Prize 2025 competition, with a $700,000 grand prize for getting to 85%, as well as many other progress prizes. It will be live on Kaggle this week.

We're also reopening our public leaderboard for continuous benchmark of commercial frontier models (and any approach built on top of them). Any model that uses less than $10,000 of retail compute cost to solve the 120 tasks of the semi-private test set is eligible.
Read 5 tweets
Jan 15, 2025
I'm joining forces with @mikeknoop to start Ndea (@ndeainc), a new AI lab.

Our focus: deep learning-guided program synthesis. We're betting on a different path to build AI capable of true invention, adaptation, and innovation. Image
Read about our goals here: ndea.com
We're really excited about our current research direction. We believe we have a small but real chance of achieving a breakthrough -- creating AI that can learn at least as efficiently as people, and that can keep improving over time with no bottlenecks in sight.
Read 6 tweets
Jan 15, 2025
People scaled LLMs by ~10,000x from 2019 to 2024, and their scores on ARC stayed near 0 (e.g. GPT-4o at ~5%). Meanwhile a very crude program search approach could score >20% with hardly any compute.

Then OpenAI started adding test-time CoT search. ARC scores immediately shot up.
It's not about scale. It's about working on the right ideas.

Like deep-learning guided CoT synthesis or program synthesis. Via search.
10,000x scale up: still flat at 0

Add CoT search, similar model scale: boom
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(