Mohamed Profile picture
Aug 5 8 tweets 3 min read Read on X
The new GPT-OSS models are Mixture of Experts (MoEs), with 20B and 120B parameters.

Since expert weights make up ~90% of the model, OpenAI decided to quantize them to 4 bits during post-training using the MXFP4 standard.

Quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory (on colab !).

Let's see how MXFP4 works and what makes it special 👇Image
1/ What is MXFP4?

• MXFP4 stands for Microscaling 4-bit Floating Point (Introduced by the Open Compute Project standard)
• Uses E2M1 format (2 exponent bits, 1 mantissa bit)
• E8M0 scaling per 32 elements

👉 Commonly refered to as 4-bit quantization with group_size = 32
2/ How MXFP4 works?

Weights are chunked into blocks of 32 elements, each sharing an 8-bit scale factor (exponent).
The scale is in E8M0 format = powers of 2 → Power-of-2 quantization which rounds the scale factor to nearest 2ⁿ
3/Memory footprint is 🔥

• 32-element block = 32×4 bits + 8-bit scale → 4.25 bits/weight
• ~3.75× smaller than BF16
• ~1.9× smaller than FP8
👉 GPT-OSS 20B shrinks from ~40GB → 13GB
👉 GPT-OSS 120B shrinks from ~240GB → 67GB
4/MXFP4 vs NVFP4

Both are 4-bit floating-point formats using E2M1, but:

• MXFP4:
32-element blocks
Single 8-bit (E8M0) scale

• NVFP4:
16-element blocks
Dual-scale: FP8 (E4M3) block-level + FP32 tensor-level scale

🔁 NVFP4 = better accuracy
⚡️ MXFP4 = less compute overhead
5/ Hardware Support:

Both MXFP4 and NVFP4 are supported on the NVIDIA Blackwell architecture 🖥️⚙️
6/ Why is MXFP4 interesting?

• GPT-OSS 20B can run on Google Colab T4 !
• GPT-OSS 120B can run on a single 80GB H100 !
• Faster inference on optimized GPUs with dedicated kernels built with triton
• Open standard → broad HW support
• Great step toward ultra-low precision
OpenAI releasing an open model using an open standard is a brilliant move.

Great work @OpenAI

MXFP4 is helping push the boundaries of open-source AI 🔓🤝

Models
🔗 GPT OSS 20B: huggingface.co/openai/gpt-oss…
🔗 GPT OSS 120B: huggingface.co/openai/gpt-oss…

MXFP4 Standard: opencompute.org/documents/ocp-…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Mohamed

Mohamed Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mekkcyber

Sep 2, 2024
While working on the 1.58 LLM project @huggingface, I played around with some kernels for Int2xInt8. I wrote my first kernel in Triton where instead of unpacking the weights before performing the matrix multiplication, I fused the two operations and unpacked the weights on the fly.

You can find the code here :


More details in the 🧵gist.github.com/MekkCyber/78c1…
1/ A GPU performs calculations by using thousands of small processing units called threads, which are grouped into blocks. Each thread can access fast local memory (registers), shared memory within its block, and a larger, slower global memory that all threads can use.
To get a clearer understanding of the architecture, check out this blog : giahuy04.medium.com/memory-types-i…
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(