Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

𝞍 Shin Megami Boson 𝞍

@shinboson

Sep 9, 2024 • 23 tweets • 8 min read • Read on X

A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.

It isn't.

They get massive news coverage and are the talk of the town, so to speak.

*If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D.

But soon, cracks appear in the story.

On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful.

Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood.

Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed.

Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway.

He even releases a publicly available endpoint for researchers to try out!

But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify.

And it turns out that Matt is a liar.

Their API was a Claude wrapper with a system prompt to make it act similar to the open source model.

Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out.

tl;dr
Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to.

Grifters shit in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them.

https://x.com/RealJosephus/status/1832904398831280448

check out mythbuster extrordinaire @RealJosephus's great thread on this

https://x.com/RealJosephus/status/1832904398831280448

@RealJosephus Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set.

The results were bad.

@RealJosephus The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.

@RealJosephus Since I don't have a substack to promote instead I'll share a preview of my next effortpost. You *can* actually get SOTA on unseen benchmarks ... if you are willing to be more liberal about what constitutes a "model". Hybrid systems here are any amalgam of models and code.

@immanencer toy example but I think sort of instructive

@ikristoph @RealJosephus lmao what this is a one time service

@ikristoph @RealJosephus okay let me find another solution

@ikristoph @RealJosephus that part doesn't especially matter though unless you want to get the exact same results

@romechenko the bet is not that he gets found out but that nobody cares what the nerds think

@Hackthestack777 (in my experience, I should say)

@3v333333 @DotCSV or, if it's claude, the current prompt has lobotomized it

@nisten At this point I suspect you're being willfully obtuse.

@RichardYannow @RealJosephus Now go do something useful and leave me alone.

@scholarc1314 but it's not a big lift overall. marginal. most effective at improving weaker models ime.

https://x.com/shinboson/status/1833542315937247295

Addendum:

https://x.com/shinboson/status/1833542315937247295

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @shinboson

𝞍 Shin Megami Boson 𝞍

@shinboson

Dec 15, 2024

ironically even though Chinese propaganda is pretty bad as far as propaganda goes, I think it actually has a materially correct view of the US military, regularly depicting it as a demon god

thread of china making the US look awesome in propaganda

US as evil wizard eaglemen nuking china

US Navy as carrier group Megatron kaiju

Read 13 tweets

𝞍 Shin Megami Boson 𝞍

@shinboson

Nov 17, 2024

https://twitter.com/MungoManic/status/1857917228680884475

Something puzzling about the Māori & polynesians is that while polynesian technology was basically stone-age, their sailing technology was among the most sophisticated in the world — they invented outrigger canoes and catamarans, and it is hard to overstate how odd this is.

https://twitter.com/MungoManic/status/1857917228680884475

One explanation for this incongruity is that the austronesians and polynesians are best understood not as a stone age people with sophisticated boats, but as a sophisticated bronze age material culture that hyper-adapted to the particular challenges of settling remote islands.

There is little evidence of the direct ancestors of the austronesians but we can look at closely related groups. This is a statue of a man from a people referred to as the Baiyue. They were famed for their sailing, and he is tattooed in the austronesian style.

Read 11 tweets

𝞍 Shin Megami Boson 𝞍

@shinboson

Oct 29, 2024

https://twitter.com/ryjlocal/status/1851001101254033712

the tim walz story is killing me.

tim, a 26 year old man has a relationship with jenna, a 24 year old woman back in '89. they fall in love, it's beautiful but it's hard. her father is a high official in the party, and there is a cultural divide.

https://twitter.com/ryjlocal/status/1851001101254033712

they can't be seen together in public but they try to make it work. walz is on a teaching visa, so he goes home to nebraska. he writes her letters. maybe he writes about the plains, about the snow. he writes about his life there. he wants to see her again. she wants to see him.

he goes back to China in '92, they try to pick up where they left off but there are challenges, and questions. they're older now, and maybe a bit wiser or maybe just a bit less fearless. where is this going? what will we do? are we going to spend the rest of our lives together?

Read 7 tweets

𝞍 Shin Megami Boson 𝞍

@shinboson

Jun 14, 2024

a funny thing about the lord of the flies is it's based on what the author thought people would do, but in 1965 when 6 teenaged boys actually did get marooned on a tropical island, instead of tearing each other apart they worked together to reconstruct civilization from scratch

they divided the labor. they built shelters, and kept a permanent fire going. they built workout equipment. they even made a guitar. they played songs and sang together at night.

after 15 months stranded they they were rescued by a man named Peter Warner, an Australian fisherman who happened to be passing by. as a reward for bringing the boys home, he was granted special fishing rights by the king of Tonga (right is him at a celebratory feast)

Read 4 tweets

𝞍 Shin Megami Boson 𝞍

@shinboson

May 20, 2024

https://twitter.com/getnormality/status/1792323981535760838

Today I investigated how well can LLMs multiply two numbers and what makes them perform better or worse. Let's start with a couple baselines. I evaluated gpt-3.5-turbo (L) and gpt-4o (R) on the task of multiplying two numbers that were both either two, three, four, or five digits

https://twitter.com/getnormality/status/1792323981535760838

Out of the box with no prompting, I found that gpt-4o could do two or three digit multiplications *most* of the time, but totally fell over when it came to four digits, with gpt-3.5-turbo having somewhat worse performance as we would expect.

The first thing I wanted to see was whether asking it to "do it out by hand" improved performance compared to base (L) and surprisingly it decreased performance (R), though this is of a similar scale to the run-to-run variance anyway so I wouldn't infer too much.

Read 18 tweets

𝞍 Shin Megami Boson 𝞍

@shinboson

May 17, 2024

I think it's cool how Arabic is basically cursive Aramaic. Apparently this is a trend. Nearly all scripts tend to develop cursive forms over time. Like carcinization, call it cursivization. Thread of examples.

1. Arabic

2. Chinese

3. Russian

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

𝞍 Shin Megami Boson 𝞍

Try unrolling a thread yourself!

More from @shinboson

𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍

𝞍 Shin Megami Boson 𝞍

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!