On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.
It isn't.
They get massive news coverage and are the talk of the town, so to speak.
*If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D.
But soon, cracks appear in the story.
On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful.
Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood.
Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed.
Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway.
He even releases a publicly available endpoint for researchers to try out!
But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify.
And it turns out that Matt is a liar.
Their API was a Claude wrapper with a system prompt to make it act similar to the open source model.
Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out.
tl;dr
Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to.
Grifters shit in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them.
check out mythbuster extrordinaire @RealJosephus's great thread on this
@RealJosephus Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set.
The results were bad.
@RealJosephus The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.
@RealJosephus Since I don't have a substack to promote instead I'll share a preview of my next effortpost. You *can* actually get SOTA on unseen benchmarks ... if you are willing to be more liberal about what constitutes a "model". Hybrid systems here are any amalgam of models and code.
@immanencer toy example but I think sort of instructive
@ikristoph @RealJosephus lmao what this is a one time service
@ikristoph @RealJosephus okay let me find another solution
@ikristoph @RealJosephus that part doesn't especially matter though unless you want to get the exact same results
@romechenko the bet is not that he gets found out but that nobody cares what the nerds think
@Hackthestack777 (in my experience, I should say)
@3v333333 @DotCSV or, if it's claude, the current prompt has lobotomized it
@nisten At this point I suspect you're being willfully obtuse.
@RichardYannow @RealJosephus Now go do something useful and leave me alone.
@scholarc1314 but it's not a big lift overall. marginal. most effective at improving weaker models ime.
ironically even though Chinese propaganda is pretty bad as far as propaganda goes, I think it actually has a materially correct view of the US military, regularly depicting it as a demon god
thread of china making the US look awesome in propaganda
Something puzzling about the Māori & polynesians is that while polynesian technology was basically stone-age, their sailing technology was among the most sophisticated in the world — they invented outrigger canoes and catamarans, and it is hard to overstate how odd this is.
One explanation for this incongruity is that the austronesians and polynesians are best understood not as a stone age people with sophisticated boats, but as a sophisticated bronze age material culture that hyper-adapted to the particular challenges of settling remote islands.
There is little evidence of the direct ancestors of the austronesians but we can look at closely related groups. This is a statue of a man from a people referred to as the Baiyue. They were famed for their sailing, and he is tattooed in the austronesian style.
tim, a 26 year old man has a relationship with jenna, a 24 year old woman back in '89. they fall in love, it's beautiful but it's hard. her father is a high official in the party, and there is a cultural divide.
they can't be seen together in public but they try to make it work. walz is on a teaching visa, so he goes home to nebraska. he writes her letters. maybe he writes about the plains, about the snow. he writes about his life there. he wants to see her again. she wants to see him.
he goes back to China in '92, they try to pick up where they left off but there are challenges, and questions. they're older now, and maybe a bit wiser or maybe just a bit less fearless. where is this going? what will we do? are we going to spend the rest of our lives together?
a funny thing about the lord of the flies is it's based on what the author thought people would do, but in 1965 when 6 teenaged boys actually did get marooned on a tropical island, instead of tearing each other apart they worked together to reconstruct civilization from scratch
they divided the labor. they built shelters, and kept a permanent fire going. they built workout equipment. they even made a guitar. they played songs and sang together at night.
after 15 months stranded they they were rescued by a man named Peter Warner, an Australian fisherman who happened to be passing by. as a reward for bringing the boys home, he was granted special fishing rights by the king of Tonga (right is him at a celebratory feast)
Today I investigated how well can LLMs multiply two numbers and what makes them perform better or worse. Let's start with a couple baselines. I evaluated gpt-3.5-turbo (L) and gpt-4o (R) on the task of multiplying two numbers that were both either two, three, four, or five digits
Out of the box with no prompting, I found that gpt-4o could do two or three digit multiplications *most* of the time, but totally fell over when it came to four digits, with gpt-3.5-turbo having somewhat worse performance as we would expect.
The first thing I wanted to see was whether asking it to "do it out by hand" improved performance compared to base (L) and surprisingly it decreased performance (R), though this is of a similar scale to the run-to-run variance anyway so I wouldn't infer too much.
I think it's cool how Arabic is basically cursive Aramaic. Apparently this is a trend. Nearly all scripts tend to develop cursive forms over time. Like carcinization, call it cursivization. Thread of examples.