Elliot Glazer @Oberwolfach◼️🌳🌳 Profile picture
Dec 25, 2024 12 tweets 3 min read
1/12 FrontierMath’s three-part rating—Background (1–5), Creativity (hours of insight), and Execution (solution time)—lets us precisely gauge problem difficulty. These ratings help provide context on o3’s benchmark results. 2/12 Background spans from high school (1) to research-level (5). Creativity asks how long an expert needs to uncover the core idea. Execution accounts for meticulous solutions, including heavy coding or advanced computations.
Dec 21, 2024 9 tweets 2 min read
1/9 We’re announcing the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3’s performance is remarkable, but there’s still a ways to go before any single AI system nears the collective genius of the math community. 2/9 For context, FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.