Bringing good stuff to the world.
CMU MLD phd. cooked with TPUs at Google Brain.
Leading Tree and Rock AI Lab (TRAIL) at NUS (Singapore)
Aug 25 • 10 tweets • 7 min read
Introducing MCPMark, a collaboration with @EvalSysOrg and @lobehub!
We created a challenging benchmark to stress-test MCP use in comprehensive contexts.
- 127 high-quality data samples created by experts.
- GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the other models fall in the range of 10-30%.
- Diverse test cases on Notion, Github, Filesystem, Playwright (browser), and Postgres.
9🧵s ahead
🧵1/9
MCP (Model Context Protocol) provides an interface for connecting AI and applications and has gained tremendous interest within the AI community.
But how well do models handle MCP uses? Frontier LLMs are actually pretty good, so we created MCPMark, a challenging evaluation set to stress-test the models. 😈