ML Research Engineer @huggingface 🤗 | Kernel Hacker | Making LLMs Faster & Smaller ⚡
Aug 5 • 8 tweets • 3 min read
The new GPT-OSS models are Mixture of Experts (MoEs), with 20B and 120B parameters.
Since expert weights make up ~90% of the model, OpenAI decided to quantize them to 4 bits during post-training using the MXFP4 standard.
Quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory (on colab !).
Let's see how MXFP4 works and what makes it special 👇1/ What is MXFP4?
• MXFP4 stands for Microscaling 4-bit Floating Point (Introduced by the Open Compute Project standard)
• Uses E2M1 format (2 exponent bits, 1 mantissa bit)
• E8M0 scaling per 32 elements
👉 Commonly refered to as 4-bit quantization with group_size = 32
Sep 2, 2024 • 11 tweets • 5 min read
While working on the 1.58 LLM project @huggingface, I played around with some kernels for Int2xInt8. I wrote my first kernel in Triton where instead of unpacking the weights before performing the matrix multiplication, I fused the two operations and unpacked the weights on the fly.
You can find the code here :
More details in the 🧵gist.github.com/MekkCyber/78c1…1/ A GPU performs calculations by using thousands of small processing units called threads, which are grouped into blocks. Each thread can access fast local memory (registers), shared memory within its block, and a larger, slower global memory that all threads can use.