Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

swyx

@swyx

Mar 14, 2023 • 17 tweets • 12 min read • Read on X

Scrolly

GPT4 is live!!!

openai.com/research/gpt-4

GPT4 gets 100% accuracy on this HumanEval task.

previous iterations were all under <50%.

holy shit.

Entire industries are being ripped up and chewed out.

RIP to:

- law students
- undergrads
- grad students
- High school biology/Calculus/Economics/Physics/Statistics/Govt

- Sommeliers
- Leetcoders

at least 10 full percentage point improvements on SOTA across 3 of the top/hardest LLM benchmarks.

improvements effortlessly transferring across languages.

But the headliner feature (widely rumored, now confirmed) is multimodality -

GPT can now see.

Not just classifying entire images.
Not segments of images.
This is arbitrarily flexible **visual comprehension**.

(and naturally, conversation, as we saw with Visual ChatGPT)

Another benefit of multimodality:

Combining visual inputs with world knowledge means you can ask it questions no vision model would be able to understand

Spotting unusual contexts and explaining memes.

Tell me this isn't some form of general intelligence.

People are likely to overlook the fine print in the paper but this is as big in my mind - major advances in safety have also been made in GPT4

19-29 full point improvements in hallucination reduction. A lot of work put into Harmful content alignment (controversial choice, ofc)

https://twitter.com/transitive_bs/status/1635694410905137238?s=20

Prompt engineers unite - the GPT4 API now takes up to 50 pages of text (32k token context)!!!!!!

(Join Travis' discord to join 10k other ChatGPT hackers discord.gg/v9gERj825w)

https://twitter.com/transitive_bs/status/1635694410905137238?s=20

as LLMs grow and grow and grow in capabilities, it is getting more impt to have good model evaluation/benchmarking frameworks.

OpenAI is also releasing their eval framework, fully MIT licensed: github.com/openai/evals

Used by Stripe and well documented. Runs MMLU in 189 LOC

GPT4 developer livestream in 2 hours.

We're hosting an impromptu watch party on the Latent Space discord: discord.gg/zVH8rvw6?event…

@OpenAI

.@OpenAI seems to be echoing Ajeya Cotra's view that predicting scaling capability is key to managing AI safety. We can predict 10,000x ahead by extrapolating from smaller models. On that respect GPT4 has been a near-complete success using the now-familiar Azure supercluster.

@OpenAI

@OpenAI In the generative AI era, it's very very good to be an OpenAI partner.

GPT4's image capability is launch exclusive to one nonprofit.

Stripe tested OpenAI Eval.

@yusuf_i_mehdi confirms Bing Chat runs on GPT4.

Khan Academy launched today with GPT-4 powered personal tutoring.

Whoop i just got let in to the ChatGPT4 beta!

@AnthropicAI

This is going to get completely lost in the noise but @AnthropicAI launched Claude/Claude+ (with @notionhq and @poe_platform launch partner) and Google launched their PaLM API today as well

https://twitter.com/adamdangelo/status/1635690630289723394?s=20

Google's blogpost is a hot mess tho lol

@AnthropicAI

@AnthropicAI @NotionHQ @poe_platform Bombshell of a One More Thing dropped by @gdb:

GPT4 does Math now.

TAX MATH.

and it checks out.

thefuq???

live discusison ongoing twitter.com/i/spaces/1eaKb…

@labenz

GPT4's image description capability is orders of magnitude more than existing CLIP derivative approaches.

Just yesterday i was listening to @labenz discuss the longevity of BLIP with @LiJunnan0409 and @DongxuLi_, i wonder what they think about it now...

https://twitter.com/altryne/status/1635736338397020160

@CadeMetz

The @CadeMetz NYT article offers a few more ideas for GPT4 multimodal usecases - take a photo of your fridge contents, get meal ideas.

inferring an impressive amount from very little visual info, and then putting them together in a sensible combo

needs adversarial testing 👿

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @swyx

swyx

@swyx

Apr 14

don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:

- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work

cookbook.openai.com/examples/gpt4-…

we're publishing our GPT 4.1 chat with @michpokrass later today pls like n sub youtube.com/@latentspacepod

Read 7 tweets

swyx

@swyx

Mar 30

.swyxrules for working on long running/expensive code:

Unbundle run() into:

- init()
- continue(id)
- cleanup(id)

never assume you will call these in order

Always checkpoint and resume from id's.

Pass nothing else.

This forces you to keep things serializable and therefore loggable, reproducible, parallelizable.

by the way try not to name them 'id' if you can add extra detail like 'runId', 'taskId', 'subTask2Id'.

Read 5 tweets

swyx

@swyx

Jan 12

was fun to work on this @latentspacepod post w/ @benhylak

inspired by "pivot token" literature, one gpt->o1 mental model shift i've made is the role of self-evaluation and -correction as an ESSENTIAL part of planning/reasoning.

with o1, you move the LLM-as-judge *INTO THE PROMPT*, so you can let it handle the self eval and replanning. this is the incremental next "agentic" step, which openai consistently does well, to the frustration of more hyperbolic but snake oily alternatives.

@latentspacepod @benhylak full context latent.space/p/o1-skill-iss…

and here

x.com/benhylak/statu…

https://twitter.com/IntuitMachine/status/1879292531273572795

@latentspacepod @benhylak shamelessly ripped off lmao

https://twitter.com/IntuitMachine/status/1879292531273572795

Read 5 tweets

swyx

@swyx

Dec 14, 2024

this neurips is really going to be remembered as the "end of pretraining" neurips

notes from doctor @polynoamial's talk on scaling test time compute today

(thank you @oh_that_hat for organizing)

all gains to date have been from scaling data and pretrain compute and yet LLMs cant solve simple problems like tictactoe

however inference costs have scaled much less.

goes back to libratus/pluribus work

poker model scaling from 2012-2015 - scaled 5x each year, but still lost dramatically (9 big bets per hundred) to poker pros in 80k hands

recalls familiar insight about humans taking longer to think for harder problems.

added 20s of search - reduced distance from nash equilibrium results reduced by a factor of 7 - roughly the equivalent of scaling up model size by 100,000x

Read 22 tweets

swyx

@swyx

Oct 1, 2024

https://twitter.com/romainhuet/status/1841161854217044235

Here’s my @OpenAIDevs day thread for those following along. everyone else gotchu with videos and stuff so i will just give personal notes and aha moments thru the day

first observation: @sama MIA

GPT5 still mentioned and on the table

https://twitter.com/romainhuet/status/1841161854217044235

after some nice screenshot of Cocounsel, time for @romainhuet’s legendary live demos. o1 one-shots an ios app and does the frotnend/backend to control a drone.

ai controlled drones, what could go wrong?

@romainhuet Realtime API announced!

starting with speech to speech support
all 6 adv voice mode voices supported

demo next

Read 43 tweets

swyx

@swyx

Sep 30, 2024

just realized NotebookLM is @GoogleDeepMind's ChatGPT moment

- "low key research preview"/"experimental"
- not monetized
- GPUs/TPUs immediately on fire
- SOTA proprietary new model buried in there with upgrade that weren't previously announced
- new AI UX that cleverly embeds LLM usage natively within the product features

in this case NBLM nailed multimodal RAG and I/O in a way that @ChatGPTapp never did (or for that matter, @GeminiApp). The multiple rounds of preprocessing described by @stevenbjohnson also raise the quality of the audio conversation dramatically at the cost of extreme latency (took an efficient model that was advertised as capable of generating 30s of audio in 0.5s, and slapped on like 200s of LLM latency haha)

https://x.com/latentspacepod/status/1840809425994727630

@GoogleDeepMind like, i put my podcast into it and it made a podcast of my podcast and... it was good.

do u guys know we spend 1-2 hrs writing up the show notes and now its a button press in NBLM

https://x.com/latentspacepod/status/1840809425994727630

@GoogleDeepMind - just hired @OfficialLoganK

if i had a penny for every time this has happened i'd have two pennies

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

swyx

Try unrolling a thread yourself!

More from @swyx

swyx

swyx

swyx

swyx

swyx

swyx

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!