The new GPT-OSS models are Mixture of Experts (MoEs), with 20B and 120B parameters.
Since expert weights make up ~90% of the model, OpenAI decided to quantize them to 4 bits during post-training using the MXFP4 standard.
Quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory (on colab !).
Let's see how MXFP4 works and what makes it special 👇
1/ What is MXFP4?
• MXFP4 stands for Microscaling 4-bit Floating Point (Introduced by the Open Compute Project standard)
• Uses E2M1 format (2 exponent bits, 1 mantissa bit)
• E8M0 scaling per 32 elements
👉 Commonly refered to as 4-bit quantization with group_size = 32
2/ How MXFP4 works?
Weights are chunked into blocks of 32 elements, each sharing an 8-bit scale factor (exponent).
The scale is in E8M0 format = powers of 2 → Power-of-2 quantization which rounds the scale factor to nearest 2ⁿ
3/Memory footprint is 🔥
• 32-element block = 32×4 bits + 8-bit scale → 4.25 bits/weight
• ~3.75× smaller than BF16
• ~1.9× smaller than FP8
👉 GPT-OSS 20B shrinks from ~40GB → 13GB
👉 GPT-OSS 120B shrinks from ~240GB → 67GB
4/MXFP4 vs NVFP4
Both are 4-bit floating-point formats using E2M1, but:
• MXFP4:
32-element blocks
Single 8-bit (E8M0) scale
Both MXFP4 and NVFP4 are supported on the NVIDIA Blackwell architecture 🖥️⚙️
6/ Why is MXFP4 interesting?
• GPT-OSS 20B can run on Google Colab T4 !
• GPT-OSS 120B can run on a single 80GB H100 !
• Faster inference on optimized GPUs with dedicated kernels built with triton
• Open standard → broad HW support
• Great step toward ultra-low precision
OpenAI releasing an open model using an open standard is a brilliant move.
Great work @OpenAI
MXFP4 is helping push the boundaries of open-source AI 🔓🤝
While working on the 1.58 LLM project @huggingface, I played around with some kernels for Int2xInt8. I wrote my first kernel in Triton where instead of unpacking the weights before performing the matrix multiplication, I fused the two operations and unpacked the weights on the fly.
1/ A GPU performs calculations by using thousands of small processing units called threads, which are grouped into blocks. Each thread can access fast local memory (registers), shared memory within its block, and a larger, slower global memory that all threads can use.