Latest Twitter Threads by @michaelqshieh on Thread Reader App

Aug 25, 2025 • 10 tweets • 7 min read

Introducing MCPMark, a collaboration with @EvalSysOrg and @lobehub!

We created a challenging benchmark to stress-test MCP use in comprehensive contexts.
- 127 high-quality data samples created by experts.
- GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the other models fall in the range of 10-30%.
- Diverse test cases on Notion, Github, Filesystem, Playwright (browser), and Postgres.

9🧵s ahead

🧵1/9

MCP (Model Context Protocol) provides an interface for connecting AI and applications and has gained tremendous interest within the AI community.

But how well do models handle MCP uses? Frontier LLMs are actually pretty good, so we created MCPMark, a challenging evaluation set to stress-test the models. 😈

🔗Links:
Github: github.com/eval-sys/mcpma…
Website: mcpmark.ai

Share this page!

Enter URL or ID to Unroll