Latest Twitter Threads by @_sumeetc on Thread Reader App

Apr 13 • 6 tweets • 2 min read

most people compare AI models

that’s the wrong abstraction

same model + different setup = completely different agent behavior

they’re effectively different engineers

we ran 70+ coding sessions of Claude Code and Codex

the gap wasn’t where we expected 🧵

our verifier agent scores every session on 5 dimensions of Agency:

initiative
collaboration
reasoning
compliance
efficiency

not benchmarks
actual execution
actual files
actual policy gates

and it explains why for every score it gives

Share this page!

Enter URL or ID to Unroll