Latest Twitter Threads by @Zai_org on Thread Reader App

Feb 3 • 4 tweets • 2 min read

Introducing GLM-OCR: SOTA performance, optimized for complex document understanding.

With only 0.9B parameters, GLM-OCR delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.

Weights: huggingface.co/zai-org/GLM-OCR
Try it: ocr.z.ai
API: docs.z.ai/guides/vlm/glm…

Optimized for real-world scenarios: It handles complex tables, code-heavy docs, official seals, and other challenging elements where traditional OCR fails.

Dec 8, 2025 • 6 tweets • 3 min read

GLM-4.6V Series is here🚀

- GLM-4.6V (106B): flagship vision-language model with 128K context
- GLM-4.6V-Flash (9B): ultra-fast, lightweight version for local and low-latency workloads

First-ever native Function Calling in the GLM vision model family

Weights: huggingface.co/collections/za…
Try GLM-4.6V now: chat.z.ai
API: docs.z.ai/guides/vlm/glm…
Tech Blog: z.ai/blog/glm-4.6v

API Pricing (per 1M tokens):
- GLM-4.6V: $0.6 input / $0.9 output
- GLM-4.6V-Flash: Free

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

Aug 11, 2025 • 5 tweets • 3 min read

Introducing GLM-4.5V: a breakthrough in open-source visual reasoning

GLM-4.5V delivers state-of-the-art performance among open-source models in its size class, dominating across 41 benchmarks.

Built on the GLM-4.5-Air base model, GLM-4.5V inherits proven techniques from GLM-4.1V-Thinking while achieving effective scaling through a powerful 106B-parameter MoE architecture.

Hugging Face: huggingface.co/zai-org/GLM-4.…
GitHub: github.com/zai-org/GLM-V
Z.ai API: docs.z.ai/guides/vlm/glm…
Try it now: chat.z.ai

Through efficient hybrid training, GLM-4.5V is equipped to handle diverse types of visual content, achieving comprehensive visual reasoning across all scenarios, including:

- Image Reasoning (scene understanding, complex multi-image analysis, geography recognition)
- Video Understanding (long video storyboard analysis, event recognition)
- GUI Tasks (screen reading, icon recognition, desktop operation assistance)
- Complex Chart and Document Analysis (research report analysis, information extraction)
- Grounding Capability (precise localization of visual elements)

Share this page!

Enter URL or ID to Unroll