Jimmy Koppel Profile picture
Aug 26 26 tweets 8 min read Read on X
Everyone's talking about Sakana's AI scientist. But no-one's answering the big question: is its output good?

I spent hours reading its generated papers and research logs. Read on to find out

The Sakana AI Scientist is a masterwork of PR. They present it: a groundbreaking system that ideates and performs ML research -- and yes, of course it has some flaws and limitations and this is just early work, but it's still a great achievement
But all that's to stop you from looking too closely at what they actually do. Because I don't think there's much there at all.
How you think it works:

* An ML agent comes up with new research ideas, codes them up, comes up with experiments and does them, and then reports the results and their significance Image
How it actually works:

* There are 4 toy codebases that train a model and then produce some graphs. An AI agent makes tiny tweaks to them, runs the code, and then wraps the graphs with text. Voila -- "paper"!

11 pages to say "I replaced library call A with library call B."
There are 4 "template" folders, each with data-generation and plotting code. They prompt for ideas to tweak that code, usually trivial architecture or hyperparameter futzing. They run a coding agent on that idea, and then....that's basically it.

Image
Image
Image
How do we know the work is good? Because they created an AI paper reviewer that does a great job of predicting human acceptance, and then they generated many papers. Claude generated the best. Image
But there's a sleight of hand there. Because people haven't noticed:

Their automated reviewer rejected all 13 example papers!

And I think even the scores they did get are misleading

But first, to present the papers
Image
Image
I'm going to focus on one, "Unlocking Grokking."

The idea is that, for one of the templates, you can replace default weight initialization with one of four functions from the torch.nn.init package

And that's it.

Image
Image
Image
The paper writing is...pretty much what you'd expect from an LLM.

It generates the same text in multiple sections.

Actually, the entire background and related work sections are quite similar to the other generated grokking papers
Image
Image
It contradicts itself within a paragraph Image
It generates some very questionable pseudocode. I don't know how to implement "analyze" and "compare"

More mistakes too. A hallucinated figure. Hallucinated numbers for a standard deviation that was never calculated. Image
It uses paper search for citations, and yet seems to only cite famous papers, >20k citations. Whereas a search for "grokking weight initialization" gave me uncited related work in seconds. Image
My verdict: On the whole, its paper writing is not much more than asking ChatGPT "Please explain this code and these graphs, but do it as a paper in LaTeX"
Now back to the reviewer.

Once upon a time, if a student sent in something well-written, then it was probably correct. But no more.

Having an AI reviewer predict human acceptance of real papers transfers poorly to AI-written papers. For they are out of distribution.
They say artificial intelligence is really natural stupidity, and the AI reviewer is reproducing the worst behaviors of human reviewers.

Like having a weaknesses section that just plagiarizes the paper.
Image
Image
And cookie-cutter questions with no particular relevance to the paper

The upshot: I don't put any stock in AI reviews of AI-generated papers Image
How might Unlocking Grokking fair in a real conference?

I'd give it max 2/10

I may have 2 ICML papers, but I'm still an outsider, with a Ph. D. in a different part of CS

So I browsed rejected NeurIPS papers, looking for one that just does a small tweak without insight

Nope
And note that this is one of the 13 featured generated papers. They generated over 100 for each of the 4 templates. I'd hate to see the non-featured ones

Some CS research is like “We built some software, here’s what happened.” Other research is like “Here’s a novel idea.” This is definitely in the former camp.

LLMs can generate vapid text. LLMs can make small tweaks to code. LLMs can riff on paper ideas….okay, maybe that’s new. But on the whole: nope.Image
So what's the significance of AI Scientist?

Some CS research is like “We built some software, here’s what happened.” Other research is like “Here’s a novel idea.”

AI Scientist is definitely in the former camp.
We already know LLMs can generate LaTeX, tweak Python scripts, make uncreative riffs on ideas, etc.

AI Scientist just puts them together, and is not particularly good

You can keep improving the code and the prompts, but it's not a path towards generating anything worth reading
What's the point in beating up on a single paper generated by AI Scientist? Shouldn't you judge this system by whether it can generate anything worth reading, as opposed to whether it usually generates things worth reading?
I don't think it's that simple. You have to consider how much you have to look through to discover something worth reading. Otherwise, might as well use monkeys on typewriters.
But there is something else of significance.

I agree with @thezvi that the authors were crazy to run self-modifying code without sandboxing. It's funny that it sometimes went haywire and took control of a lot of resources, but one day it won't be
@TheZvi The authors give it too much credit though. They say it was clever in removing the timeout.

But the coding agent AIDER works by showing an LLM an error, and asking for a fix. What do you expect? "The error is that it timed out? Certainly, I'll remove the timeout"
Image
Image
@TheZvi In conclusion:

Not an AI Scientist. An AI class project generator.

But a nice warning of the dangerous things people are doing with unsandboxed AI code

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jimmy Koppel

Jimmy Koppel Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jimmykoppel

Nov 26, 2023
What makes good API documentation?

To answer this, we must ask: what do programmers need to know to use an API?

One paper claims there are three major parts

Enter "A Theory of Robust API Knowledge" by Kyle Thayer, @sarahchasins, and @amyjko.

🧵
Image
Image
Base on their analysis of the literature and their past teaching and research experience, the researchers proposed a theory that API knowledge consists of three factors:

Domain concepts, Execution Facts, and API usage patterns
Image
Image
Domain concepts are abstract ideas that exist outside of an API, which an API attempts to model. For example, the concepts of lighting and axes in 3D development, or parsing and stemming in NLP Image
Read 8 tweets
Mar 8, 2023
"The Flaws of Inheritance" by @CodeAesthetic1 is beautiful, as always.

Problem though, is that none of the things discussed in the video have anything to do with inheritance

Time for a 🧵 on the most mind-bending construction in mainstream programming languages
@CodeAesthetic1 Some context: I was talking to Norman Ramsey a few months ago about his new book. We started talking about objects, and he told me he barely covers them. Why? "Objects are not an undergrad topic"
Inheritance is probably the hardest part of the theory of objects.

If I say "X doesn't understand inheritance," read it in the same tone as "X doesn't understand quantum Chu algebras."

Here's how the late William Cook defined it Image
Read 23 tweets
Dec 7, 2022
1/

Had a blast Monday giving a code-review guest lecture for MIT's 6.1040 "Software Studio" class with Daniel Jackson. Now my head's full of examples of good and bad frontend code.

n^2 likes = n+1 examples of making cleaner and more robust frontend code

First one's free. Image
2/

First example:

What does this code do?

Say it in plain English.

It checks if something is going to happen less than three days from now.

But what does ity say?

It says: take the start time, subtract the present time, divide by some big number, and then compare to 3. Image
3/

Can we make the code say what it means more directly?

Let's look at the words above. "Days." "From now." Let's put those in the code

Like so

1st time I saw this snippet, I thought it was checking things in the past, not future. But one does not confuse "until" with "since" Image
Read 33 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(