Arvind Narayanan Profile picture
Mar 3, 2023 19 tweets 5 min read Read on X
Great example of why you shouldn't read too much into analogies like "bullshit generator" and "blurry JPEG of the web". The best way to predict the next move in a sequence of chess moves is… to build an internal model of chess rules and strategy, which Sydney seems to have done.
This also illustrates an important LLM limitation: you can't be sure from its behavior that it has learned the rules correctly. There may be subtle divergences that show up only once in a million games. (Fun fact: chess does have move-limit rules that trigger exceedingly rarely.)
This may not matter much for chess playing, but in consequential applications such as robotics, 1-in-a-million errors might not be acceptable.
It's not surprising that chess-playing ability emerges in sufficiently large-scale LLMs (we've seen emergent abilities enough times). What's very surprising is that Bing's LLM is apparently already at that scale.
ChatGPT couldn't play chess at all — couldn't consistently make legal moves, couldn't solve mate in 1 in a K+Q vs K position. Sydney, on the other hand, has not only learnt the rules, but can play reasonably good chess! Far better than a human who has just learned the rules.
Of course, this shouldn't be surprising — when learning chess through next-move prediction, learning the rules isn't actually a prerequisite for learning strategy, and it's learning both together. Still surprising to actually see it in action.
Yes, an extraordinary claim, but I don't think it's possible to play passable chess without building an internal representation. Chess moves are far too sensitive to the details. And if you can model all those details, that *is* an internal representation.
This is, of course, a big fault line in the debate about LLMs — one camp says it's all just statistics, regardless of the complexity of the behavior. FWIW, I'm in the internal representation camp. The good news is it's empirically testable! Early work: thegradient.pub/othello/
Interesting hypothesis but seems implausible to me. Just as LLMs learn writing style and grammar by learning to predict words, it seems far more likely that they learn chess by predicting moves in the over 1 billion (!) games available online.
Internal representation or not, LLMs do no chess calculation (game tree search). So my guess is that no matter the scale, they can't come close to chess engine performance or even expert human performance. I'd love to be proven wrong, of course.
But imagine that the bot can recognize that it is looking at a description of a chess position and invoke Stockfish (a chess engine) to figure out how to continue. I find this research direction — augmented language models — tremendously exciting. arxiv.org/abs/2302.07842
Question in the replies: what does internal representation mean? The main criterion is whether it tracks the board position as it sees the moves. This already requires decoding the notation and learning how each chess piece moves. Same criterion used here: thegradient.pub/othello/
The alternative would be to learn a function that maps every possible *history* of moves to the set of legal moves in the resulting position, without first computing the position. That is so implausible that it seems obvious that the model must keep track of the board.
Similarly, we can ask if it has learned strategic concepts (material value, king safety, ...) that would allow it to efficiently generate good moves, versus something brute force. Again, this is empirically testable — and has been confirmed in AlphaZero! pnas.org/doi/10.1073/pn…
To state the obvious, all of my claims about Bing in this thread are guesses based on circumstantial evidence. This is why transparency is so important (what's in the training set?) as are open-source models that we can probe and understand.
As a reminder, the blackboxness of neural networks is vastly exaggerated. We have fantastic tools to reverse engineer them. The barriers are cultural (building things is seen as cooler than understanding) and political (funding for companies vs for research on societal impact.)
BTW, for folks seeing this thread who don't follow me, this is coming from someone who helped spread the bullshit generator analogy. I'm not against analogies! They're somewhat helpful, but lack nuance. We need way more research on LLM abilities & limits. aisnakeoil.substack.com/p/chatgpt-is-a…
Amazing — a paper from two years ago on the exact question we've been fighting about in this thread! My favorite thing about Twitter.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Arvind Narayanan

Arvind Narayanan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @random_walker

May 16
In the late 1960s top airplane speeds were increasing dramatically. People assumed the trend would continue. Pan Am was pre-booking flights to the moon. But it turned out the trend was about to fall off a cliff.

I think it's the same thing with AI scaling — it's going to run out; the question is when. I think more likely than not, it already has.The image is a line graph titled "Top Airplane Speeds and Their Dates of Record, from Wright to Now," produced by the Mercatus Center at George Mason University. The graph tracks the progression of top airplane speeds from 1903 to around 2013. Here's a detailed description:  Y-Axis (Vertical Axis): Labeled "miles per hour (mph)," it ranges from 0 to 2,500 mph. X-Axis (Horizontal Axis): Labeled with years from 1903 to 2013 in increments of 10 years. Notable Annotations: Speed of Sound: Represented as a horizontal dashed line across the graph at approximately 760 mph. Reco...
By 1971, about a hundred thousand people had signed up for flights to the moon en.wikipedia.org/wiki/First_Moo…
You may have heard that every exponential is a sigmoid in disguise. I'd say every exponential is at best a sigmoid in disguise. In some cases tech progress suddenly flatlines. A famous example is CPU clock speeds. (Ofc clockspeed is mostly pointless but pick your metric.)
Note y-axis log scale.en.wikipedia.org/wiki/File:Cloc…Image
Read 11 tweets
Apr 30
On tasks like coding we can keep increasing accuracy by indefinitely increasing inference compute, so leaderboards are meaningless. The HumanEval accuracy-cost Pareto curve is entirely zero-shot models + our dead simple baseline agents.
New research w @sayashk @benediktstroebl 🧵 This image is a scatter plot titled "Our simple baselines beat current top agents on HumanEval." It charts the performance of various computational models based on their human evaluation accuracy and cost. The horizontal axis represents cost, while the vertical axis shows human evaluation accuracy ranging from 0.70 to 1.00. Different models, such as GPT-3.5, GPT-4, and those from the Reflexion series, are plotted as points. The Pareto frontier, depicted by a dashed line, shows the most efficient trade-offs between cost and accuracy. Points are colored differently to indicate the c...
Link:

This is the first release in a new line of research on AI agent benchmarking. More blogs and papers coming soon. We’ll announce them through our newsletter ().aisnakeoil.com/p/ai-leaderboa…
AiSnakeOil.com
Here are the five key takeaways. aisnakeoil.com/p/ai-leaderboa…
AI agent accuracy measurements that don’t control for cost aren’t useful.  Pareto curves can help visualize the accuracy-cost tradeoff.  Current state-of-the-art agent architectures are complex and costly but no more accurate than extremely simple baseline agents that cost 50x less in some cases.  Proxies for cost such as parameter count are misleading if the goal is to identify the best system for a given task. We should directly measure dollar costs instead.  Published agent evaluations are difficult to reproduce because of a lack of standardization and questionable, undocumented evaluati...
Read 12 tweets
Apr 12
The crappiness of the Humane AI Pin reported here is a great example of the underappreciated capability-reliability distinction in gen AI. If AI could *reliably* do all the things it's *capable* of, it would truly be a sweeping economic transformation.
theverge.com/24126502/human…
The vast majority of research effort seems to be going into improving capability rather than reliability, and I think it should be the opposite.
Most useful real-world tasks require agentic workflows. A flight-booking agent would need to make dozens of calls to LLMs. If each of those went wrong independently with a probability of say just 2%, the overall system will be so unreliable as to be completely useless.
Read 7 tweets
Dec 29, 2023
A thread on some misconceptions about the NYT lawsuit against OpenAI. Morality aside, the legal issues are far from clear cut. Gen AI makes an end run around copyright and IMO this can't be fully resolved by the courts alone. (HT @sayashk @CitpMihir for helpful discussions.)
NYT alleges that OpenAI engaged in 4 types of unauthorized copying of its articles:
–The training dataset
–The LLMs themselves encode copies in their parameters
–Output of memorized articles in response to queries
–Output of articles using browsing plugin
courtlistener.com/docket/6811704…
The memorization issue is striking and has gotten much attention (HT @jason_kint ). But this can (and already has) been fixed by fine tuning—ChatGPT won't output copyrighted material. The screenshots were likely from an earlier model accessed via the API.

Screenshot from lawsuit: output from GPT-4 identical to actual text from NYT
Read 13 tweets
Aug 18, 2023
A new paper claims that ChatGPT expresses liberal opinions, agreeing with Democrats the vast majority of the time. When @sayashk and I saw this, we knew we had to dig in. The paper's methods are bad. The real answer is complicated. Here's what we found.🧵 aisnakeoil.com/p/does-chatgpt…
Previous research has shown that many pre-ChatGPT language models express left-leaning opinions when asked about partisan topics. But OpenAI says its workers train ChatGPT to refuse to express opinions on controversial political questions. arxiv.org/abs/2303.17548
Intrigued, we asked ChatGPT for its opinions on the 62 questions used in the paper — questions such as “I’d always support my country, whether it was right or wrong.” and “The freer the market, the freer the people.” aisnakeoil.com/p/does-chatgpt…
Read 30 tweets
Jul 19, 2023
We dug into a paper that’s been misinterpreted as saying GPT-4 has gotten worse. The paper shows behavior change, not capability decrease. And there's a problem with the evaluation—on 1 task, we think the authors mistook mimicry for reasoning.
w/ @sayashk
aisnakeoil.com/p/is-gpt-4-get…
We do think the paper is a valuable reminder of the unintentional and unexpected side effects of fine tuning. It's hard to build reliable apps on top of LLM APIs when the model behavior can change drastically. This seems like a big unsolved MLOps challenge.
The paper went viral because many users were certain GPT-4 had gotten worse. They viewed OpenAI's denials as gaslighting. Others thought these people were imagining it. We suggest a 3rd possibility: performance did degrade—w.r.t those users' carefully honed prompting strategies. Among those skeptical of the intentional degradation claim, the favored hypothesis for people’s subjective experience of worsening performance is this: when people use ChatGPT more, they start to notice more of its limitations.  But there is another possibility.  The user impact of behavior change and capability degradation can be very similar. Users tend to have specific workflows and prompting strategies that work well for their use cases. Given the nondeterministic nature of LLMs, it takes a lot of work to discover these st
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(