Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Mohamed

@mekkcyber

Aug 5 • 8 tweets • 3 min read • Read on X

The new GPT-OSS models are Mixture of Experts (MoEs), with 20B and 120B parameters.

Since expert weights make up ~90% of the model, OpenAI decided to quantize them to 4 bits during post-training using the MXFP4 standard.

Quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory (on colab !).

Let's see how MXFP4 works and what makes it special 👇

1/ What is MXFP4?

• MXFP4 stands for Microscaling 4-bit Floating Point (Introduced by the Open Compute Project standard)
• Uses E2M1 format (2 exponent bits, 1 mantissa bit)
• E8M0 scaling per 32 elements

👉 Commonly refered to as 4-bit quantization with group_size = 32

2/ How MXFP4 works?

Weights are chunked into blocks of 32 elements, each sharing an 8-bit scale factor (exponent).
The scale is in E8M0 format = powers of 2 → Power-of-2 quantization which rounds the scale factor to nearest 2ⁿ

3/Memory footprint is 🔥

• 32-element block = 32×4 bits + 8-bit scale → 4.25 bits/weight
• ~3.75× smaller than BF16
• ~1.9× smaller than FP8
👉 GPT-OSS 20B shrinks from ~40GB → 13GB
👉 GPT-OSS 120B shrinks from ~240GB → 67GB

4/MXFP4 vs NVFP4

Both are 4-bit floating-point formats using E2M1, but:

• MXFP4:
32-element blocks
Single 8-bit (E8M0) scale

• NVFP4:
16-element blocks
Dual-scale: FP8 (E4M3) block-level + FP32 tensor-level scale

🔁 NVFP4 = better accuracy
⚡️ MXFP4 = less compute overhead

5/ Hardware Support:

Both MXFP4 and NVFP4 are supported on the NVIDIA Blackwell architecture 🖥️⚙️

6/ Why is MXFP4 interesting?

• GPT-OSS 20B can run on Google Colab T4 !
• GPT-OSS 120B can run on a single 80GB H100 !
• Faster inference on optimized GPUs with dedicated kernels built with triton
• Open standard → broad HW support
• Great step toward ultra-low precision

OpenAI releasing an open model using an open standard is a brilliant move.

Great work @OpenAI

MXFP4 is helping push the boundaries of open-source AI 🔓🤝

Models
🔗 GPT OSS 20B: huggingface.co/openai/gpt-oss…
🔗 GPT OSS 120B: huggingface.co/openai/gpt-oss…

MXFP4 Standard: opencompute.org/documents/ocp-…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @mekkcyber

Mohamed

@mekkcyber

Sep 2, 2024

While working on the 1.58 LLM project @huggingface, I played around with some kernels for Int2xInt8. I wrote my first kernel in Triton where instead of unpacking the weights before performing the matrix multiplication, I fused the two operations and unpacked the weights on the fly.

You can find the code here :

More details in the 🧵gist.github.com/MekkCyber/78c1…

1/ A GPU performs calculations by using thousands of small processing units called threads, which are grouped into blocks. Each thread can access fast local memory (registers), shared memory within its block, and a larger, slower global memory that all threads can use.

To get a clearer understanding of the architecture, check out this blog : giahuy04.medium.com/memory-types-i…

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Mohamed

Try unrolling a thread yourself!

More from @mekkcyber

Mohamed

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!