Introducing GLM-OCR: SOTA performance, optimized for complex document understanding.
With only 0.9B parameters, GLM-OCR delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.
Optimized for real-world scenarios: It handles complex tables, code-heavy docs, official seals, and other challenging elements where traditional OCR fails.
The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
GLM-OCR achieves a throughput of 1.86 pages/second for PDF documents and 0.67 images/second for images, significantly outperforming comparable models.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
- GLM-4.6V (106B): flagship vision-language model with 128K context
- GLM-4.6V-Flash (9B): ultra-fast, lightweight version for local and low-latency workloads
First-ever native Function Calling in the GLM vision model family
GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.
GLM-4.6V delivers an end-to-end multimodal search-and-analysis workflow, enabling the model to move seamlessly from visual perception to online retrieval, to reasoning and to final answer.
Introducing GLM-4.5V: a breakthrough in open-source visual reasoning
GLM-4.5V delivers state-of-the-art performance among open-source models in its size class, dominating across 41 benchmarks.
Built on the GLM-4.5-Air base model, GLM-4.5V inherits proven techniques from GLM-4.1V-Thinking while achieving effective scaling through a powerful 106B-parameter MoE architecture.
Through efficient hybrid training, GLM-4.5V is equipped to handle diverse types of visual content, achieving comprehensive visual reasoning across all scenarios, including:
- Image Reasoning (scene understanding, complex multi-image analysis, geography recognition)
- Video Understanding (long video storyboard analysis, event recognition)
- GUI Tasks (screen reading, icon recognition, desktop operation assistance)
- Complex Chart and Document Analysis (research report analysis, information extraction)
- Grounding Capability (precise localization of visual elements)
Webpage Replication: Please generate a high-quality UI interface using CSS and HTML based on the webpage I provided. chat.z.ai/s/f4389582-bcd…