We have reached an agreement in principle for Sam Altman to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo.
We are collaborating to figure out the details. Thank you so much for your patience through this.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…
SWE-Lancer tasks span the full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. SWE-Lancer includes both independent engineering tasks and management tasks, where models choose between technical implementation proposals.
SWE-Lancer tasks more realistically capture the complexity of modern software engineering. Our tasks are full-stack and complex; the average task took freelancers over 21 days to resolve.
These improvements in capabilities can also be leveraged to improve safety. Today we’re releasing a paper on deliberative alignment that shares how we harnessed these advances to make our o1 and o3 models even safer to use. openai.com/index/delibera…
What’s changed since the preview? A faster, more powerful reasoning model that’s better at coding, math & writing.
o1 now also supports image uploads, allowing it to apply reasoning to visuals for more detailed & useful responses.
OpenAI o1 is more concise in its thinking, resulting in faster response times than o1-preview.
Our testing shows that o1 outperforms o1-preview, reducing major errors on difficult real-world questions by 34%.
The updated OpenAI o1 system card builds on prior safety work, detailing robustness evals, red teaming insights, and safety improvements using Instruction Hierarchy. It maintains a "medium" risk rating based on testing with an expanded suite of evaluations, reflecting it is safe to deploy. openai.com/index/openai-o…