Alok Profile picture
Mechatronics Engineer AI belongs on your device. • Offline inference • No subscriptions. Teaching you to own your...
Jun 18 4 tweets 2 min read
Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context

Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM:

Prefill: 1000+ tok/s (42% increase)
Decode: 25+ tok/s (25% increase)
Context: 120k (150% increase)

prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments)

llama.cpp TurboQuant flags:

-m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080

tested with a 27k prompt, 120k context loaded.

-ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card.

TurboQuant's KV cache savings are what free up the room to do that at 120k context.

side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+

rig: RTX 4060 8GB · i7H · 16GB RAM

same two flags as yesterday, different model size:

--cache-type-k q8_0 --cache-type-v turbo3

thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (@no_stp_on_snek) to make this work.

unsloth's model quant huggingface and the llama.cpp fork github link in the comments

Do you prefer a dense or a MoE for your 8GB card? GitHub - TheTom/llama-cpp-turboquant: LLM inference in C/C++ · GitHub
github.com/TheTom/llama-c…
Jun 8 4 tweets 3 min read
Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included!

local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec.

and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup!

Before MTP: 20 tps -> After MTP: 28 tps!

llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models.

By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware.

Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in.

You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies).

copy and try the exact flags:

-m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v

n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow.

if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself.

Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060.

MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively.

what are you benchmarking today Gemma 4 26b a4b mtp assistant drafter model

huggingface.co/RachidAR/gemma…