For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a ~smaller~ ~model~ like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!

The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn't stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.

all 19 comments

sorted by: hot top controversial new old

[–] [email protected] 5 points 3 months ago (1 children)

The technology for quantisation has improved a lot this past year making very small quants viable for some uses. I think the general consensus is that an 8bit quant will be nearly identical to a full model. Though a 6bit quant can feel so close that you may not even notice any loss of quality.

Going smaller than that is where the real trade off occurs. 2-3 bit quants of much larger models can absolutely surprise you, though they will probably be inconsistent.

So it comes down to the task you're trying to accomplish. If it's programming related, 6bit and up for consistency with whatever the largest coding model you can fit. If it's creative writing or something a much lower quant with a larger model is the way to go in my opinion.

[–] [email protected] 2 points 3 months ago (1 children)

Hmm, so what you're saying is that for creative generations one should use big parameter models with strong quants but when good structure is required, like with coding and JSON output, we want to use a large quant of a model which actually fits into our VRAM?

I'm currently testing JSON output, so I guess a small Qwen model it is! (they advertised good JSON generations)

Does the difference between fp8 and fp16 influence the structure strongly, or are fp8 models fine for structured content?

[–] [email protected] 1 points 3 months ago (1 children)

fp8 would probably be fine, though the method used to make the quant would greatly influence that.

I don't know exactly how Ollama works but a more ideal model I would think would be one of these quants

https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF

A GGUF model would also allow some overflow into system ram if ollama has that capability like some other inference backends.

[–] [email protected] 2 points 3 months ago (1 children)

Ollama does indeed have the ability to share the memory between VRAM and RAM, but I always assumed it wouldn't make sense, since it would massively slow down the generation.

I think ollama already uses GGUF, since that is how you import the model from HF to ollama anyway, you gotta use the *.GGUF file.

As someone who has experience with shader development in glsl, I know very well that communication between the GPU and CPU is super slow, and sending data from the GPU to the CPU is a pretty heavy task. So I just assumed it wouldn't make any sense. I will try a full 7B model (fp16) model now using my 32GB of normal RAM to check out the speed. I'll edit this comment once I'm done and share results

[–] [email protected] 1 points 3 months ago (1 children)

With modern methods sometimes running a larger model split between GPU/CPU can be fast enough. Here's an example https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

[–] [email protected] 1 points 3 months ago (2 children)

oooh a windows only feature, now I see why I haven't heard of this yet. Well, too bad I guess. It's time to switch to AMD for me anyway...

[–] fhein 2 points 3 months ago

Article is written in a bit confusing way, but you'll most likely want to turn off Nvidia's automatic VRAM swapping if you're on Windows, so it doesn't happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it's easier to find how many layers you can offload if it fails to load instead when you set it too high.

Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven't checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.

[–] [email protected] 1 points 3 months ago

Oh, that part is. But the splitting tech is built into llama.cpp

[–] [email protected] 2 points 3 months ago* (last edited 3 months ago)

A 2bit or 3bit quantization is quite some trade-off. At 2bit, it'll probably be worse then a smaller model with a lesser quantization. At the same effective size.

There is a sweet spot somewhere between 4 to 8 bit(?). And more than 8bit seems to be a waste, it seems indistinguishable from full precision.

General advice seems to be: Take the largest model you can fit at somewhere around 4bit or 5bit.

The official way to compare such things is calculate the perplexity for all of the options and choose the one with the smallest perplexity, that fits.

And by the way: I don't really use the tiny models like 3B parameters. They write text, but they don't seem to be able to store a lot of knowledge. And in turn they can't handle any complex questions and they generally make up a lot of things. I usually use 7B to 14B parameter models. That's a proper small model. And I stick to 4bit or 5bit quants for llama.cpp

Your graphics card should be able to run a 8B parameter LLM (4-bit quantized) I'd prefer that to a 3B one, it'll be way more intelligent.

[–] j4k3 2 points 3 months ago (1 children)

I prefer a middle ground. My favorite model is still the 8 x 7b mixtral and specifically the flat/dolphin/maid uncensored model. Llama 3 can be better in some areas but alignment is garbage in many areas.

[–] [email protected] 2 points 3 months ago (2 children)

Yeaaa those models are just too large for most people... You gotta have 56GB of VRAM to run an 8bit quant, which most people don't have a quarter of.

Also, what specifically do you mean by alignment? Are you talking about finetuning or instruction alignment?

[–] fhein 1 points 3 months ago (1 children)

Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.

To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it's worth using 8 bpw.

Personally I always use as large models as I can. With Q2 quantization the 70B models I've used occasionally give bad results, but often they feel smarter than 35B Q4. Though it's ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I'd take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you'll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.

[–] [email protected] 1 points 3 months ago

Pulled an 7B Q4 model just now an woah, yeah, they really are a lot better. I guess the smaller models really are just for devices with less than 1 GB of RAM to spare... Like ma phone, which runs Llama3.2 3B just fine...

[–] brucethemoose 2 points 2 months ago* (last edited 2 months ago) (1 children)

I used to have a 6GB GPU, and around 7B is the sweetspot. This is still the case with newer models, you just have to pick the right model.

Try a IQ4 quantization of Qwen 2.5 7B coder.

Below 3bpw is where its starts to not be worth it, since we have so many open weights availible these days. A lot of people are really stubbern and run 2-3bpw 70B quants, but they are objectively worse than a similarly trained 32B model in the same space, even with exotic, expensive quantization like VPTQ or AQLM: https://huggingface.co/VPTQ-community

[–] [email protected] 1 points 2 months ago (1 children)

Is this VPTQ similar to that 1.58Q I've heard about? Where they quantized the Llama 8B down to just 1.5 Bits and it somehow still was rather comprehensive?

[–] brucethemoose 2 points 2 months ago

No, from what I've seen it falls off below 4bpw (just less slowly than other models) and makes ~2.25 bit quants somewhat usable instead of totally impractical, largely like AQLM.

You are thinking of bitnet, which (so far, though not after many tries) requires models to be trained from scratch that way to be effective.