The Smaller Quant Won: Full GPU Offload Beat the Bigger Model on a 4GB Card

Q: What is `-ngl 99` actually doing?

`-ngl` is llama.cpp's number-of-GPU-layers flag. `99` is just "more than the model has," which means "offload every layer." It only helps if the fully-offloaded model plus its KV cache fits in VRAM — otherwise llama.cpp falls back or fails. With IQ3_XS it fits; with the bigger quant it wouldn't.

I run the reasoning brain for my whole operation on a Windows box with a 4GB AMD GPU. Not a rented H100. Not an API. A consumer card that most people would tell you can't hold a 7B model. For weeks it couldn't hold one well — and the fix turned out to be the opposite of what instinct says. I made the model smaller and it got 2.5x faster.

Here's the tuning log, with the real numbers, because this is the single highest-leverage thing I've learned running local LLMs in production.

The setup

The model is DeepSeek-R1-Distill-Qwen-7B, served by llama.cpp's llama-server.exe inside a Docker container, wrapped as a Windows service (NSSM) so it survives reboots. It listens on port 1234 with an OpenAI-compatible API, and my Mac Studio talks to it over the LAN as the "reasoning" endpoint. The Mac handles orchestration and vision (DeepSeek-VL2-small on MLX, port 9445); the Windows box handles text reasoning. Two machines, one job each, $0 marginal cost per call.

The GPU is the constraint. It has 4GB of VRAM. A 7B model at Q4_K_M is roughly 4.4GB of weights before you even add the KV cache — so it does not fit.

The trap: crank the layers, spill to CPU

llama.cpp lets you decide how many transformer layers to push onto the GPU with -ngl. The obvious move is to push as many as fit. So I did, in stages:

7 layers on GPU: 12.6 tok/s
16 layers on GPU: 15.9 tok/s
20 layers on GPU: 17.8 tok/s

Every layer I moved onto the card bought me a few tokens per second. But there are 29 layers total, and 20 was the ceiling — the rest had to stay in system RAM and run on the CPU. That split is the whole problem. When even a handful of layers live on the CPU, every single token has to cross the PCIe bus and wait on the slow path. The GPU spends its time idle, waiting for the CPU-resident layers to finish. Partial offload isn't "most of the speed" — it's bottlenecked to the slowest layers.

I was staring at 17.8 tok/s and treating it as the physics of a 4GB card. It wasn't.

The fix: shrink the model until the whole thing fits

Instead of asking "how many layers of the big quant can I fit," I asked "which quant is small enough that all 29 layers fit at once." The answer was an imatrix IQ3_XS build of the same model — 3.19GB on disk instead of 4.4GB.

New config: -c 8192 -fa on -ctk q8_0 -ctv q8_0 --cache-ram 2048 -ngl 99. That -ngl 99 means "offload everything." With IQ3_XS it actually succeeds: 29/29 layers on the GPU. Vulkan reports the model at 2962MB and the quantized KV cache at 238MB — about 3.2GB total, which slots into 4GB with room to breathe.

Result: 31.9 tok/s. From a 12.6 tok/s baseline, that's +153%. From my best partial-offload run of 17.8, it's +79%. The smaller model on a worse-sounding config crushed the bigger model on the "optimized" config, because it killed the CPU-layer bottleneck entirely.

But does the smaller quant still think?

This is the fair objection. IQ3_XS is aggressive — 3-bit-ish weights. If it fell apart on reasoning, the speed would be worthless. So I ran it against multi-step problems I already knew the answers to: a couple of arithmetic word problems (467, 432, 81), the classic train-relative-speed question, and the cats-and-mice puzzle. It went 5/5. The imatrix calibration is doing its job — it protects the weights that matter most for coherence.

One caveat I'll pass on: R1-style models emit their chain-of-thought into a separate reasoning_content field and put the clean answer in content. On a low-bit quant the model will occasionally over-reason and spiral on adversarial trick questions, so I keep max_tokens at 2000 or higher to give it room to finish thinking. Don't starve it and then blame the quant.

The principle

Fitting the entire model on the GPU beats fitting a bigger, higher-precision model most of the way. The instinct is "more bits = better, don't compromise the weights." But a model that spills to CPU pays a latency tax on every token, and that tax dwarfs the quality difference between Q4 and IQ3 for a distilled 7B. The right question on a small card isn't "what's the highest precision I can partially load" — it's "what's the largest quant that fully offloads."

I kept the Q4_K_M build on disk as a quality-max fallback in case IQ3 ever disappoints on a real task. It hasn't yet. And the practical win is that my reasoning endpoint now answers fast enough to sit inside a live workflow instead of being something I batch overnight — all on a card people said was too small.

FAQ

Why is a smaller quant faster if it's the same model?

Because speed on a memory-constrained GPU is dominated by whether the whole model fits in VRAM. The IQ3_XS build (3.19GB) fits all 29 layers on a 4GB card; the Q4_K_M build (~4.4GB) doesn't, so some layers run on the CPU and every token waits on the PCIe round-trip. Removing the CPU-resident layers removed the bottleneck.

What is `-ngl 99` actually doing?

-ngl is llama.cpp's number-of-GPU-layers flag. 99 is just "more than the model has," which means "offload every layer." It only helps if the fully-offloaded model plus its KV cache fits in VRAM — otherwise llama.cpp falls back or fails. With IQ3_XS it fits; with the bigger quant it wouldn't.

Does IQ3_XS hurt reasoning quality?

In my testing it held up — 5/5 on multi-step arithmetic and logic puzzles I knew the answers to, thanks to imatrix calibration that protects the most important weights. The one thing to watch is giving it enough max_tokens (I use 2000+) so a chain-of-thought model can finish reasoning instead of getting cut off.

Can I really run production AI on a 4GB GPU?

For a distilled 7B reasoning model, yes — once you stop fighting the VRAM limit and pick a quant that fully offloads. 31.9 tok/s is fast enough for interactive use. The lesson generalizes: match the quant to the card, don't force a card to swallow a quant it can't hold.

The Smaller Quant Won: Full GPU Offload Beat the Bigger Model on a 4GB Card

The setup

The trap: crank the layers, spill to CPU

The fix: shrink the model until the whole thing fits

But does the smaller quant still think?

The principle

FAQ

Why is a smaller quant faster if it's the same model?

What is `-ngl 99` actually doing?

Does IQ3_XS hurt reasoning quality?

Can I really run production AI on a 4GB GPU?

Related

How a Smaller Quant Made My Local LLM 2.5x Faster: Fitting DeepSeek-R1 Entirely on a 4GB GPU

Own Your Inference or Watch the Bills Climb

Build Your Own Semantic Search: A Local Embeddings Engine on Apple Silicon

The setup

The trap: crank the layers, spill to CPU

The fix: shrink the model until the whole thing fits

But does the smaller quant still think?

The principle

FAQ

Why is a smaller quant faster if it's the same model?

What is -ngl 99 actually doing?

Does IQ3_XS hurt reasoning quality?

Can I really run production AI on a 4GB GPU?

Related

How a Smaller Quant Made My Local LLM 2.5x Faster: Fitting DeepSeek-R1 Entirely on a 4GB GPU

Own Your Inference or Watch the Bills Climb

Build Your Own Semantic Search: A Local Embeddings Engine on Apple Silicon

What is `-ngl 99` actually doing?