𝞍 Shin Megami Boson 𝞍 Profile picture
Sep 9, 2024 23 tweets 8 min read Read on X
A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.

It isn't. Image
They get massive news coverage and are the talk of the town, so to speak.

*If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D.

But soon, cracks appear in the story.


Image
Image
Image
Image
On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful.

Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood.

Image
Image
Image
Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed. Image
Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway.

He even releases a publicly available endpoint for researchers to try out!
Image
Image
But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify.

And it turns out that Matt is a liar. Image
Their API was a Claude wrapper with a system prompt to make it act similar to the open source model.

Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out.


Image
Image
Image
Image
tl;dr
Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to.

Grifters shit in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them. Image
check out mythbuster extrordinaire @RealJosephus's great thread on this
@RealJosephus Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set.

The results were bad.
Image
Image
@RealJosephus The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.
Image
Image
@RealJosephus Since I don't have a substack to promote instead I'll share a preview of my next effortpost. You *can* actually get SOTA on unseen benchmarks ... if you are willing to be more liberal about what constitutes a "model". Hybrid systems here are any amalgam of models and code. Image
@immanencer toy example but I think sort of instructive
@ikristoph @RealJosephus lmao what this is a one time service
@ikristoph @RealJosephus okay let me find another solution
@ikristoph @RealJosephus that part doesn't especially matter though unless you want to get the exact same results
@romechenko the bet is not that he gets found out but that nobody cares what the nerds think
@Hackthestack777 (in my experience, I should say)
@3v333333 @DotCSV or, if it's claude, the current prompt has lobotomized it
@nisten At this point I suspect you're being willfully obtuse.
@RichardYannow @RealJosephus Now go do something useful and leave me alone.
@scholarc1314 but it's not a big lift overall. marginal. most effective at improving weaker models ime.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with 𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @shinboson

Dec 15, 2024
ironically even though Chinese propaganda is pretty bad as far as propaganda goes, I think it actually has a materially correct view of the US military, regularly depicting it as a demon god

thread of china making the US look awesome in propaganda Image
US as evil wizard eaglemen nuking china
US Navy as carrier group Megatron kaiju Image
Read 13 tweets
Nov 17, 2024
Something puzzling about the Māori & polynesians is that while polynesian technology was basically stone-age, their sailing technology was among the most sophisticated in the world — they invented outrigger canoes and catamarans, and it is hard to overstate how odd this is. Image
One explanation for this incongruity is that the austronesians and polynesians are best understood not as a stone age people with sophisticated boats, but as a sophisticated bronze age material culture that hyper-adapted to the particular challenges of settling remote islands. Image
There is little evidence of the direct ancestors of the austronesians but we can look at closely related groups. This is a statue of a man from a people referred to as the Baiyue. They were famed for their sailing, and he is tattooed in the austronesian style. Image
Image
Read 11 tweets
Oct 29, 2024
the tim walz story is killing me.

tim, a 26 year old man has a relationship with jenna, a 24 year old woman back in '89. they fall in love, it's beautiful but it's hard. her father is a high official in the party, and there is a cultural divide.
they can't be seen together in public but they try to make it work. walz is on a teaching visa, so he goes home to nebraska. he writes her letters. maybe he writes about the plains, about the snow. he writes about his life there. he wants to see her again. she wants to see him.
he goes back to China in '92, they try to pick up where they left off but there are challenges, and questions. they're older now, and maybe a bit wiser or maybe just a bit less fearless. where is this going? what will we do? are we going to spend the rest of our lives together?
Read 7 tweets
Jun 14, 2024
a funny thing about the lord of the flies is it's based on what the author thought people would do, but in 1965 when 6 teenaged boys actually did get marooned on a tropical island, instead of tearing each other apart they worked together to reconstruct civilization from scratch Image
they divided the labor. they built shelters, and kept a permanent fire going. they built workout equipment. they even made a guitar. they played songs and sang together at night. Image
after 15 months stranded they they were rescued by a man named Peter Warner, an Australian fisherman who happened to be passing by. as a reward for bringing the boys home, he was granted special fishing rights by the king of Tonga (right is him at a celebratory feast)
Image
Image
Read 4 tweets
May 20, 2024
Today I investigated how well can LLMs multiply two numbers and what makes them perform better or worse. Let's start with a couple baselines. I evaluated gpt-3.5-turbo (L) and gpt-4o (R) on the task of multiplying two numbers that were both either two, three, four, or five digits

Image
Image
Out of the box with no prompting, I found that gpt-4o could do two or three digit multiplications *most* of the time, but totally fell over when it came to four digits, with gpt-3.5-turbo having somewhat worse performance as we would expect. Image
The first thing I wanted to see was whether asking it to "do it out by hand" improved performance compared to base (L) and surprisingly it decreased performance (R), though this is of a similar scale to the run-to-run variance anyway so I wouldn't infer too much.
Image
Image
Read 18 tweets
May 17, 2024
I think it's cool how Arabic is basically cursive Aramaic. Apparently this is a trend. Nearly all scripts tend to develop cursive forms over time. Like carcinization, call it cursivization. Thread of examples.

1. Arabic
Image
Image
2. Chinese
Image
Image
3. Russian
Image
Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(