Zhihu Frontier Profile picture
Nov 8 β€’ 1 tweets β€’ 3 min read β€’ Read on X
πŸš€ "Quantization is not a compromise β€” it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
εˆ˜ε°‘δΌŸ, infra engineer at @Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice matters β€” and why quantization today isn't just about sacrificing precision for speed.

πŸ’‘ Key idea
In the context of LLMs, quantization is no longer a trade-off.
With the evolution of param-scaling and test-time-scaling, native low-bit quantization will become a standard paradigm for large model training.

πŸ€” Why Low-bit Quantization Matters
In modern LLM inference, there are two distinct optimization goals:
β€’ High throughput (cost-oriented): maximize GPU utilization via large batch sizes.
β€’ Low latency (user-oriented): minimize per-query response time.
For Kimi-K2's MoE structure (with 1/48 sparsity), decoding is memory-bound β€” the smaller the model weights, the faster the compute.
FP8 weights (β‰ˆ1 TB) already hit the limit of what a single high-speed interconnect GPU node can handle.
⚠️ By switching to W4A16, latency drops sharply while maintaining quality β€” a perfect fit for low-latency inference.

πŸ” Why QAT over PTQ
Post-training quantization (PTQ) worked well for shorter generations, but failed in longer reasoning chains:
β€’ Error accumulation during long decoding degraded precision.
β€’ Dependence on calibration data caused "expert distortion" in sparse MoE layers.
‼️Thus, K2-Thinking adopted QAT for minimal loss and more stable long-context reasoning.

🧠 How it works
K2-Thinking uses a weight-only QAT with fake quantization + STE (straight-through estimator).
The pipeline was fully integrated in just days β€” from QAT training β†’ INT4 inference β†’ RL rollout β€” enabling near lossless results without extra tokens or retraining.

⚑ INT4's hidden advantage in RL
Few people mention this: native INT4 doesn't just speed up inference β€” it accelerates RL training itself.
Because RL rollouts often suffer from "long-tail" inefficiency, INT4's low-latency profile makes those stages much faster.
In practice, each RL iteration runs 10-20% faster end-to-end.
Moreover, quantized RL brings stability: smaller representational space reduces accumulation error, improving learning robustness.

πŸ”© Why INT4, not MXFP4
Kimi chose INT4 over "fancier" MXFP4/NVFP4 to better support non-Blackwell GPUs, with strong existing kernel support (e.g., Marlin).
At a quant scale of 1Γ—32, INT4 matches FP4 formats in expressiveness while being more hardware-adaptable.

🧭 Looking forward
W4A16 is just the beginning β€” W4A8 and even W4A4 are on the horizon.
As new chips roll out with FP4-native operators, Kimi's quantization path will continue evolving.

"In the LLM age, quantization stands alongside SOTA and Frontier.
It's not a patch β€” it's how we'll reach the frontier faster."

πŸ“– Full article (in Chinese): zhihu.com/question/19695…
#KimiK2Thinking #INT4 #Quantization #LLM #Infra #RLHFImage
Image
Image
Image

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Zhihu Frontier

Zhihu Frontier Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(