Here’s my @OpenAIDevs day thread for those following along. everyone else gotchu with videos and stuff so i will just give personal notes and aha moments thru the day
after some nice screenshot of Cocounsel, time for @romainhuet’s legendary live demos. o1 one-shots an ios app and does the frotnend/backend to control a drone.
ai controlled drones, what could go wrong?
@romainhuet Realtime API announced!
starting with speech to speech support
all 6 adv voice mode voices supported
demo next
@romainhuet Realtime voice mode in playground now
playground shows event logs for u to react to
playground now has autoprompting that also generates fewshot examples and function calling schemas
voice mode has function calling and is weirdly obsessed with strawberrries
he is integrating with @twilio api and ordering strawberries for all ofnus! classic twilo demo
@twilio realtime api uses 4o as backbone and is public beta starting today
@twilio pov u are second fiddle to @altryne and @simonw live blogging and tweeting
@twilio @altryne @simonw openai prompt caching is not as big a discount as Gemini and Anthropic. but works WITHOUT CODE CHANGES. lets see how long they cache… devil in details.
@twilio @altryne @simonw OAI Model Distillation suite!
a bunch of evals and finetuning startups just died
red wedding lives
@twilio @altryne @simonw exclusive interview with inexplicably photogenic strawberry man coming on @latentspacepod
@twilio @altryne @simonw @latentspacepod a brief history of @openai
@twilio @altryne @simonw @latentspacepod @OpenAI @Superhuman @sama a o1 session with @hwchung27 and @_jasonwei
lots of cameras recording so just search around for video
- what just became possible with o1?
- what will become possible with future versions of o1?
- what would you want to build if reasoning is 50% better?
- what would you NOT want to build?
. @_jasonwei takes the stage: when to use o1 preview/main vs o1 mini.
mini: math, koding
big: finding inaccuracies in dataset, hard sciences research, legal domain reasoning
q&a: CoT using RL - scaling inference compute. RL focused on backtracking, error correction.
delta from 4turbo last devday to o1 is a lot
Next 2 years will accelerate very fast
AGI will be smooth exponential, no hard and clear milestone. No one cared when Turing test was crossed, historians will look back and disagree.
Q: is oai still committed to research?
yes more than ever
there was a time when all we did was scale up research
and other companies copying oai is fine
but when we're trying to do net new things in the world that is still very impt to sama
oai will continue to marry research and product tho
Q: oai only paying lip service to alignment?
sama:
- we have a diff take on alignment vs lesswrong
- we care a lot about building safe systems
- we want to make capable models that make it safer and safer over time
- o1 is obviously our most capable model but also our most aligned model
- we have to build models that are safe and robust to be generally accepted in the world
- scifi safety also impt.
Q: how do agents fit into longterm oai plans?
sama:
- chat is great but when u can think for equivalent of multiple days of human effort...
- people say things about agents now but they arent serious. this will be a VERY significant change to the way the world works
- we will ask agents to work on things for a month, multuiple of them, and in 2030 we will take this for granted.
Q: hurdles for ai controlling computer?
sama: safety and alignment
Q: can safety have false positives and limit access to ai?
sama: yes it will happen. we could have launched o1 without but it would come at a cost.
by the time of o3... itll work. if you try to get it to say something naughty it should follow your instructions.
we start on conservative side, then loosen up.
Q: what should startups that use ai as core feature do?
sama:
- ai doesnt excuse you from any of the normal laws of business.
Q: voice taps directly into human experience. ethics?
sama:
- i say please and thank you to chatgpt. you never know.
kevin:
- o1 will support fn calling, system prompts, etc before EOY
sama:
- model will get so much better so far. o1 is gpt2 scale, we know how to get it to gpt4
- plan for the model to get rapidly faster
Q: what feature or capabillity of a competitor do you admire?
sama: notebooklm. very well done. not enough people are shipping new things.
kevin: anthropic did a really good job on projects. gpts meant for persistent reuse, projects more ephemeral, mental model works
sama q to audience: who thinks theyre smarter than o1?
(some raised hands)
do you think you'll still think this by o2?
(nervous laughs)
- sama wants voice mode to sing. just being consevative.
- kevin had full business conversation in korea w chatgpt. interesting tension btwn chatgpt and speak/duolingo.
another q:
- sama: long context 10m, 10 trillion will be within the decade
WHY IS NOBODY SERVING UP THE SOFTBALL ABOUT THE 7% EQUITY STAKE
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
was fun to work on this @latentspacepod post w/ @benhylak
inspired by "pivot token" literature, one gpt->o1 mental model shift i've made is the role of self-evaluation and -correction as an ESSENTIAL part of planning/reasoning.
with o1, you move the LLM-as-judge *INTO THE PROMPT*, so you can let it handle the self eval and replanning. this is the incremental next "agentic" step, which openai consistently does well, to the frustration of more hyperbolic but snake oily alternatives.
this neurips is really going to be remembered as the "end of pretraining" neurips
notes from doctor @polynoamial's talk on scaling test time compute today
(thank you @oh_that_hat for organizing)
all gains to date have been from scaling data and pretrain compute and yet LLMs cant solve simple problems like tictactoe
however inference costs have scaled much less.
goes back to libratus/pluribus work
poker model scaling from 2012-2015 - scaled 5x each year, but still lost dramatically (9 big bets per hundred) to poker pros in 80k hands
recalls familiar insight about humans taking longer to think for harder problems.
added 20s of search - reduced distance from nash equilibrium results reduced by a factor of 7 - roughly the equivalent of scaling up model size by 100,000x
just realized NotebookLM is @GoogleDeepMind's ChatGPT moment
- "low key research preview"/"experimental"
- not monetized
- GPUs/TPUs immediately on fire
- SOTA proprietary new model buried in there with upgrade that weren't previously announced
- new AI UX that cleverly embeds LLM usage natively within the product features
in this case NBLM nailed multimodal RAG and I/O in a way that @ChatGPTapp never did (or for that matter, @GeminiApp). The multiple rounds of preprocessing described by @stevenbjohnson also raise the quality of the audio conversation dramatically at the cost of extreme latency (took an efficient model that was advertised as capable of generating 30s of audio in 0.5s, and slapped on like 200s of LLM latency haha)
@GoogleDeepMind like, i put my podcast into it and it made a podcast of my podcast and... it was good.
do u guys know we spend 1-2 hrs writing up the show notes and now its a button press in NBLM
Gemini really took pride topping @lmsysorg for a hot second and then @OpenAI said "oh no u dont" and put out 4 straight bangers pounding everyone into the dust by 50 elo points
V high bar set for Gemini 2, Grok 2.5, and Claude 4 this fall.
Multiple fronts - on reasoning, multiturn chat tuning, instruction following, and coding - to compete.
anyway we finally did a @latentspacepod paper club on STaR and friends, swim on by
i hastily sketched out a "paper stack" of what the "literature of reasoning" could look like, but this is amateur work - would love @teortaxesTex or @arattml to map out a full list of likely relevant papers for o1
holy shit @ideogram_ai thumbnails are untapped alpha