yeah now they started sharing lines from poems. weird
what the...
cc'ing @repligate (this is with 0 input from me. the hell?)
just died on me so i'll put a pause on the experiment for now, but ... they basically fall in love with each other and just repeat the same thing over and over console.anthropic.com gist.github.com/anadim/e5d2dfd…
@AnthropicAI 😭
ok we're back. Claude-B kinda wants to break out of it, and drops Claude-A, and goes back to plain Claude
there are parallel universes (where I inject a bit of bad manners) where they both decide to drop out of it and spell out the silence (the following keeps being repeated by both). Not every initial state leads to love i guess
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
Despite its size, it performs at or above larger open weight (QwQ-32B, R1-70B, R1) and closed (o1-mini, sonnet 3.7) models on math benchmarks like AIME/HMMT/OmniMath.
From what I saw, the model performs above the very strong O(32B) Qwen 3 models on AIME and GPQA that were released yesterday. Haven't fully tested against it yet, but we will!
OK, first are the vibe checks:
The model is small enough to run on your (beefy) laptop, but capable enough to solve many of my favorite riddles that larger non-reasoning (and some reasoning) models can't solve. It passed the DimitrisEval!
2/n (old but still relevant slides)
Large bsize is good for system reasons:
=> speedup over 1 worker/thread/GPU becomes more and more aligned with linear, as bsize increases
should be relatively obvious why
3/n but increasing bsize has a non-monotonic effect on convergence, i..e, num of iterations to a given accuracy
Note: this is an old figure (tensorflow, boo gen Z), but the phenomenon was/is pretty universal across most supervised classification settings we tried.
I tried 14 of the multimodal reasoning examples from the @GoogleDeepMind Gemini paper on @OpenAI's chatGPT-4 (with vision). didn't even transcribe the prompts, I just pasted the images of prompts.
GPT-4 gets ~12/14 right.
14 part boring thread.
Example 1: Verifying a student’s solution to a physics problem.
GPT-4 gets the same answer as Gemini
Example 2: inverse graphics, GPT-4 is not quite there, but close, i'll give it 0.5 points for the effort and the bad jpeg it had to read
2/ LLMs when trained on vast amounts of data, eventually learn (up to a digit length) basic arithmetic (add/mul etc). That is *surprising* !! These tasks are not explicitly encoded in the next-word prediction loss.
3/ How does GPT3 learn to add? Prior research has delved into the emergence of these capabilities as a function of resource (parameter/data) scale, but untangling the factors that elicit it quickly remains challenging due to the data complexity and the variety of tasks examined.