• Ilya plotted over a year with Mira to remove Sam
• Dario wanted Greg fired and himself in charge of all research
• Mira told Ilya that Sam pitted her against Daniela
• Ilya wrote a 52 page memo to get Sam fired and a separate doc on Greg
• Ilya didn't expect employees to feel strongly about Sam's firing
• Adam D'Angelo asked Ilya to prepare the memo
• Mira told Ilya that Greg was fired from Stripe
• Mira provided Ilya of screenshots of texts between Greg and Sam
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I looked at ZeroBench. I didn't like any of the examples I looked at. I would not interpret a significant improvement on this eval as a significant improvement in models' visual reasoning.
(1/8)
The main issues are:
(1) The visual reasoning tested is too simple. Many questions are essentially counting different classes of objects and then summing or multiplying the counts. For example, counting the number of pens that have caps.
[2/8]
(2) The difficulty of ZeroBench is artificially inflated. Images have very tiny features, or models need to perform the same subtask 100 times, or chain together many simpler subtasks.
For example, q27 requires a model to count the number of tiles on this wall.
Why GPT-4 in particular, not another model? OpenAI's announced a new GPT every year so far, but not this year... yet
So some predictions are more specific to OpenAI versus what I might say for Google, FAIR, DeepMind, etc
16k-32k context window: There's been a 4x in context window every year, so following this trend, 2022 should see a 32k token context window. (Psst - OpenAI silently bumped up the context window for text-davinci-002)