Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)'s

May 28 • 8 tweets • 2 min read

Anthropomorphization of intermediate tokens as reasoning/thinking traces isn't quite a harmless fad, and may be pushing LRM research into questionable directions.. So we decided to put together a more complete argument.. 👇🧵 1/

These anthropomorphization tendencies include both viewing intermediate tokens as interpretable traces of LLM's "thinking" and confusing the length of the intermediate tokens as indicative of the "thinking effort" 2/

Sep 23, 2024 • 10 tweets • 5 min read

A research note describing our evaluation of the planning capabilities of o1 🍓 is now on @arxiv (thanks to @karthikv792 & @kayastechly). As promised, here is a summary (..although you should read the whole thing..) 🧵 1/ arxiv.org/abs/2409.13373

https://twitter.com/rao2z/status/1837163109246443918

FollowingOpenAI's own statements, as well as our own understanding of what o1 is doing 👇, we treat o1 as an LRM that is fundamentally different from all the LLMs that preceded it (e.g. the RL on CoT moves; and the costly inference stage) 2/

https://x.com/rao2z/status/1834354533931385203

Sep 12, 2024 • 4 tweets • 5 min read

My (pure) speculation about what OpenAI o1 might be doing

[Caveat: I don't know anything more about the internal workings of o1 than the handful of lines about what they are actually doing in that blog post--and on the face of it, it is not more informative than "It uses Python er.. RL".. But here is what I told my students as one possible way it might be working]

There are two things--RL and "Private CoT" that are mentioned in the writeup. So imagine you are trying to transplant a "generalized AlphaGo"--let's call it GPTGo--onto the underlying LLM token prediction substate.

To do this, you need to know

(1) What are the GPTGo moves? For AlphaGo, we had GO moves). What would be the right moves when the task is just "expand the prompt".. ?

(2) Where is it getting its external success/failure signal from? for AlphaGo, we had simulators/verifiers giving the success/failure signal. The most interesting question in glomming the Self-play idea for general AI agent is where is it getting this signal? (See e.g. x.com/rao2z/status/1… )

My guess is that the moves are auto-generated CoTs (thus the moves have very high branching factor). Let's assume--for simplification--that we have a CoT-generating LLM, that generates these CoTs conditioned on the prompt.

The success signal is from training data with correct answers. When the expanded prompt seems to contain the correct answer (presumably LLM-judged?), then it is success. If not failure.

The RL task is: Given the original problem prompt, generate and select a CoT, and use it to continue to extend the prompt (possibly generating subgoal CoTs after every few stages). Get the final success/failure signal for the example (for which you do have answer).

Loop on a gazillion training examples with answers, and multiple times per example. [The training examples with answers can either be coming from benchmarks, or from synthetic data with problems and their solutions--using external solvers; see x.com/rao2z/status/1…]

Let RL do its thing to figure out credit-blame assignment for the CoTs that were used in that example. Incorporate this RL backup signal into the CoT genertor's weights (?).

During inference, stage, you can basically do rollouts (a la the original AlphaGo) to further improve the effectiveness of the moves ("internal CoT's"). The higher the roll out, the longer the time.

My guess is that what O1 is printing as a summary is just a summary of the "winning path" (according to it)--rather than the full roll out tree.

===
Assuming I am on the right path here in guessing what o1 is doing, a couple corollaries:

1. This can at least be better than just fine tuning on the synthetic data (again see x.com/rao2z/status/1…)--we are getting more leverage out of the data by learning move (auto CoT) generators. [Think behavior cloning vs. RL..]

2. There will not still be any guarantees that the answers provided are "correct"--they may be probabilistically a little more correct (subject to the training data). If you want guarantees, you still will need some sort of LLM-Modulo approach even on top of this (c.f. arxiv.org/abs/2402.01817).

3. It is certainly not clear that anyone will be willing to really wait for long periods of time during inference (it is already painful to wait for 10 sec for a 10 word last letter concatenation!). See x.com/rao2z/status/1…

The kind of people who will wait for longer periods would certainly want guarantees--and there are deep and narrow System 2's a plenty that can be used for many such cases.

4. There is a bit of a Ship of Theseus feel to calling o1 an LLM--considering how far it is from the other LLM models (all of which essentially have teacher-forced training and sub-real-time next token prediction. That said, this is certainly an interesting way to build a generalized system 2'ish component on top of LLM substrates--but without guarantees. I think we will need to understand how this would combine with other efforts to get System 2 behavior--including LLM-Modulo (arxiv.org/abs/2402.01817) that give guarantees for specific classes.

to be contd..

Once you are an approximate reasoner, you might develop the "don't tell me how to solve the problem; I already have a way I use to solve the problem" complex..👇

https://x.com/rao2z/status/1835093955744350605

Oct 21, 2023 • 14 tweets • 5 min read

Can LLMs really self-critique (and iteratively improve) their solutions, as claimed in the literature?🤔

Two new papers from our group investigate (and call into question) these claims in reasoning () and planning () tasks.🧵 1/ arxiv.org/abs/2310.12397
arxiv.org/abs/2310.08118

One paper, lead by @kayastechly (w/ @mattdmarq), evaluated the claims over a suite of graph coloring problems. The setup allows for GPT4 guessing a valid coloring in stand alone and self-critiquing modes. There is an external sound verifier outside the self-critiquing loop. 2/

Jul 19, 2023 • 8 tweets • 3 min read

It is hilarious that LLMs are making traditional symbolic #AI relevant, and in the process merrily exposing the ignorance of the post-Alexnet yung'uns who skipped their Intro #AI's to do MORE LAYERS, only to find themselves busy with ersatz natural science with LLMs. 🧵 1/ Without background in combinatorial search and logical inference, you are susceptible to conflating brute force search (or forest of jumbled thoughts prompting) as something to be proud of instead of seeing them for their "Rube Goldberg" silliness.. 2/

https://twitter.com/rao2z/status/1659715298679832577?s=20

Jun 8, 2023 • 8 tweets • 3 min read

[Paradoxes of Approximate Omniscience:] 🧵We all know, by now, that our intuitions *suck* at high dimensions.

We haven't yet come to grips with the fact that our intuitions about *approximate omniscience* suck too!

(And this explains some of our puzzlement at LLMs)
1/ Re: high-D, we all have been surprised by the core-less apples. George Dantzig famously explained the surprising efficiency of (worst case exponential) Simplex algorithm with a pithy "One's intuition in higher dimensional space is not worth a damn!" 2/

https://twitter.com/rao2z/status/1287387393130000385?s=20

May 15, 2023 • 10 tweets • 3 min read

Planning & LLMs: A(nother) 🧵

Making plans in the world involves (1) discovering actions (and their precondition/effect causal dependencies), and (2) sequencing an appropriate subset of available/discovered actions to achieve the agent's goals. 1/ The former requires *broad knowledge* about actions available in the world and their individual effects, while the latter requires deep drilling-down over a given set of actions to ensure that all goals are supported (causal chaining) without any undesirable interactions. 2/

Apr 20, 2023 • 12 tweets • 5 min read

🧵Been reading several recent arXiv entries claiming planning capabilities of #LLM's. This area is so full of anthropomorphisms--"Chain of Thought Prompting", "Inner Monolog"--that it cries out for a cleansing read of Drew's #AI meets Natural Stupidity 1/

https://twitter.com/rao2z/status/889509356084928518?s=20

One popular line claims that while LLM's may give wrong plans, they can improve with right prompting (which, in one case, is claimed to even induce "inner monolog" all West World host-like).

The prompts in the appendices however seem to suggest a Clever Hans effect in action 2/

Apr 20, 2023 • 5 tweets • 3 min read

So @TheEconomist tells me now that #LLMs can do planning and reasoning after all. Obviously our own dismal experience of their planning performance (c.f. the 🧵 at

https://twitter.com/rao2z/status/1643463201462579200?s=20

) must be a clear outlier.. 🙄 Thank goodness I pay big bucks for my subscription.. 1/

Interestingly, I was just telling someone today how several of the papers on "LLMs for Task Planning by Prompting" are rife with the Clever Hans effect (c.f. en.wikipedia.org/wiki/Clever_Ha… ). I guess I will have to do a thread.. 2/

Apr 5, 2023 • 10 tweets • 4 min read

Afraid of #GPT4 going rogue and killing y'all? Worry not. Planning has got your back. You can ask it to solve any simple few step classical planning problem and snuff that "AGI spark" well and good.

Let me explain.. 🧵 1/ Almost a year back, intrigued by the breathless "LLMs are Zero Shot reasoners" papers, we tested their ability to autonomously come up with simple plans given domain models. The results were *pretty bleak.*👇 2/

https://twitter.com/rao2z/status/1539435614503768065?s=20

Dec 23, 2022 • 5 tweets • 2 min read

In bemoaning how things are getting worse everyday, we often tend to forget that the state of the world is becoming monotonically more observable. 1/ It may not be so much that there is monotonically increasing suffering in this world, but that it is monotonically more observable--we can be aware of it, if we choose to. 2/

Jul 29, 2022 • 10 tweets • 4 min read

The impressive deep pattern recognition abilities of #DNN's such as #LLM's are sometimes confused for reasoning abilities

I can learn to guess, with high accuracy, whether a SAT instance is satisfiable or not, but this not the same as knowing how to solve SAT. Let me explain. 1/ Suppose you train a learner with a large number of Boolean 3-SAT instances labeled with whether or not they are satisfiable. There is no reason to doubt that a modern #DNN-based leaner will manage to learn deep features corresponding to the γ ratio-- #clauses/#variable .. 2/

Jul 11, 2022 • 14 tweets • 5 min read

There seems to be an almost willful confusion about the need and role for explainability of #AI systems on #AI twitter.

Contrary to the often polarizing positions, it is neither the case that we always need explanations nor is it the case that we never need explanations. 🧵1/ We look for explanations of high level decisions of (what for us are) explicit knowledge tasks; and where contestability and collaboration are important.

We rarely look for explanations of tacit knowledge/low level control decisions. 2/

Jun 22, 2022 • 7 tweets • 4 min read

Intrigued by the profusion of 'em "#LLM's are Zero-shot <XXX>'s" papers, we set out to see how good LLMs are at planning and reasoning about change.

tldr; off-the-shelf #GPT3 is pretty bad at these..

👉arxiv.org/abs/2206.10498

(w/ @karthikv792 @sarath_ssreedh & @_aolmo_) 1/

Our benchmark tasks (prompts) are posed in the context of common "toy domains" used in automated planning, and are small enough to not involve any huge combinatorics. In particular, they should be accessible to lay humans. 2/

Share this page!

Enter URL or ID to Unroll