Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

François Chollet

@fchollet

Feb 17, 2024 • 6 tweets • 3 min read • Read on X

The "aha" moment when I realized that curve-fitting was the wrong paradigm for achieving generalizable modeling of problems spaces that involve symbolic reasoning was in early 2016.

I was trying every possible way to get a LSTM/GRU based model to classify first-order logic statements, and each new attempt was showing a bit more clearly than the last that my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions.

It has been fascinating to see this observation echo again and again over the past 8 years.

From 2013 to 2016 I was actually quite convinced that RNNs could be trained to learn any program. After all, they're Turing-complete (or at least some of them are) and they learn a highly compressed model of the input:output mapping they're trained on (rather than mere pointwise associations). Surely they could perform symbolic program synthesis in some continuous latent program space?

Nope. They do in fact learn mere pointwise associations and completely useless for program synthesis. The problem isn't with what the function space can represent -- the problem is the learning process. It's SGD.

Ironically, Transformers are even worse in that regard -- mostly due to their strongly interpolative architecture prior. Multi-head-attention literally hardcodes sample interpolation in latent space. Also, the fact that recurrence is a really helpful prior for symbolic programs.

Not saying that Transformers are worse than RNNs, mind you -- Transformers are *the best* at *what deep learning does* (generalizing via interpolation), specifically *because* of their strongly interpolative architecture prior (MHA). They are, however, worse at learning symbolic programs (which RNNs also largely fail at anyway).

As pointed out by @sirbayes, this paper has a formal investigation into the observation from the first tweet -- that "my models were completely unable to learn to perform actual first-order logic -- despite the fact that this ability was definitely part of the representable function space. Instead, the models would inevitably latch onto statistical keyword associations to make their predictions"

The paper concludes: "while models can attain near-perfect test accuracy on training distributions, they fail catastrophically on other distributions; we demonstrate that they have learned to exploit statistical features rather than to emulate the correct reasoning function"

Basically, SGD will always latch onto statistical correlations as a shortcut for fitting the training distribution, which prevents it from finding the generalizable form of the target program (the one that would operate outside of the training distribution), despite that program being part of the search space.starai.cs.ucla.edu/papers/ZhangIJ…

There are essentially two main options to remedy this:

1. Find ways to perform active inference, so that the model adapts its learned program in contact with a new data distribution at test time. Would likely lead to some meaningful progress, but it isn't the ultimate solution, more of an incremental improvement.

2. Change the training mechanism to something more robust than SGD, such as the MDL principle. This would pretty much require moving away from deep learning (curve fitting) altogether and embracing discrete program search instead (which I have advocated for many years as a way to tackle reasoning problems...)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @fchollet

François Chollet

@fchollet

Mar 24

Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with.

It keeps the same format as ARC-AGI-1, while significantly increasing the signal strength it provides about a system's actual fluid intelligence. Expect more novelty, less redundancy, and deeper levels of concept recombination. There's a lot more focus on probing abilities that are still missing from frontier reasoning systems, like on-the-fly symbol interpretation, multi-step compositional reasoning, and context-dependent rules.

ARC-AGI-2 is fully human-calibrated. We tested these tasks with 400 people in live sessions, and we only kept tasks that could reliably be solved by multiple people. Each eval set (public, private, semi-private) has the exact same human difficulty – average people in our test sample achieve 60% with no prior training, and a panel of 10 people achieve 100%.

ARC-AGI-2 dataset: github.com/arcprize/ARC-A…

Full details on the release: arcprize.org/blog/announcin…

In addition to the ARC-AGI-2 release, we're launching the ARC Prize 2025 competition, with a $700,000 grand prize for getting to 85%, as well as many other progress prizes. It will be live on Kaggle this week.

We're also reopening our public leaderboard for continuous benchmark of commercial frontier models (and any approach built on top of them). Any model that uses less than $10,000 of retail compute cost to solve the 120 tasks of the semi-private test set is eligible.

Read 5 tweets

François Chollet

@fchollet

Jan 15

I'm joining forces with @mikeknoop to start Ndea (@ndeainc), a new AI lab.

Our focus: deep learning-guided program synthesis. We're betting on a different path to build AI capable of true invention, adaptation, and innovation.

Read about our goals here: ndea.com

We're really excited about our current research direction. We believe we have a small but real chance of achieving a breakthrough -- creating AI that can learn at least as efficiently as people, and that can keep improving over time with no bottlenecks in sight.

Read 6 tweets

François Chollet

@fchollet

Jan 15

People scaled LLMs by ~10,000x from 2019 to 2024, and their scores on ARC stayed near 0 (e.g. GPT-4o at ~5%). Meanwhile a very crude program search approach could score >20% with hardly any compute.

Then OpenAI started adding test-time CoT search. ARC scores immediately shot up.

It's not about scale. It's about working on the right ideas.

Like deep-learning guided CoT synthesis or program synthesis. Via search.

10,000x scale up: still flat at 0

Add CoT search, similar model scale: boom

Read 4 tweets

François Chollet

@fchollet

Dec 20, 2024

Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.

It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive, but it's not just brute -- these capabilities are new territory and they demand serious scientific attention.

My full statement here: arcprize.org/blog/oai-o3-pu…

So, is this AGI?

While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI -- there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3.

This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI -- without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.

Read 8 tweets

François Chollet

@fchollet

Nov 9, 2024

When we develop AI systems that can actually reason, they will involve deep learning (as one of two major components, the other one being discrete search), and some people will say that this "proves" that DL can reason.

No, it will have proven the thesis that DL is not enough, and that we need to combine DL with discrete search.

From my DL textbook (1st edition), published in 2017. Seven years later, there is now overwhelming momentum towards this exact approach.

I find it especially obtuse when people point to progress on math benchmark as evidence of LLMs being AGI, given that all of this progress has been driven by methods that leverage discrete search. The empirical data is completely vindicating that DL in general, and LLMs in particular, can't do math on their own, and that we need discrete search.

Read 4 tweets

François Chollet

@fchollet

Oct 26, 2024

In the last Trump administration, legal, high-skilled immigration was cut by ~30% before Covid, then by 100% after Covid (which was definitely a choice: a number of countries kept issuing residency permits and visas). However illegal immigrant inflows did not go down (they've been stable since the mid-2000s).

If you're a scientist or engineer applying for a green card, you're probably keenly aware that your chances of eventually obtaining it are highly dependent on the election. What you may not know is that, if you're a naturalized citizen, your US passport is also at stake

The last Trump administration launched a "denaturalization task force" aiming at taking away US citizenship from as many naturalized citizens as possible, with an eventual target of 7M (about one third of all naturalized citizens). Thankfully, they ran into a little problem: the courts.

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

François Chollet

Try unrolling a thread yourself!

More from @fchollet

François Chollet

François Chollet

François Chollet

François Chollet

François Chollet

François Chollet

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!