LocalLLaMA

2615 readers

16 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

[email protected]

Anyone found "optimal" settings for llama.cpp partial offload? (self.localllama)

submitted 5 days ago by fhein to c/[email protected]

3 comments fedilink hide all child comments

In case anyone isn't familiar with llama.cpp and GGUF, basically it allows you to load part of the model to regular RAM if you can't fit all of it in VRAM, and then it splits the inference work between CPU and GPU. It is of course significantly slower than running a model entirely on GPU, but depending on your use case it might be acceptable if you want to run larger models locally.

However, since you can no longer use the "pick the largest quantization that fits in memory" logic, there are more choices to make when choosing which file to download. For example I have 24GB VRAM, so if I want to run a 70B model I could either use a Q4_K_S quant and perhaps fit 40/80 layers in VRAM, or a Q3_K_S quant and maybe fit 60 layers instead, but how will it affect speed and text quality? Then there are of course IQ quants, which are supposedly higher quality than a similar size Q quant, but possibly a little slower.

In addition to the quantization choice, there are additional flags which affect memory usage. For example I can opt to not offload the KQV cache, which would slow down inference, but perhaps it's a net gain if I can offload more model layers instead? And I can save some RAM/VRAM by using a quantized cache, probably with some quality loss, but I could use the savings to load a larger quant and perhaps that would offset it.

Was just wondering if someone has already done experiments/benchmarks in this area, did not find any exact comparisons on search engines. Planning to do some benchmarks myself but not sure when I have time.

you are viewing a single comment's thread
view the rest of the comments

[–] Audalin 3 points 4 days ago

What I've ultimately converged to without any rigorous testing is:

using Q6 if it fits in VRAM+RAM (anything higher is a waste of memory and compute for barely any gain), otherwise either some small quant (rarely) or ignoring the model altogether;
not really using IQ quants - afair they depend on a dataset and I don't want the model's behaviour to be affected by some additional dataset;
other than the Q6 thing, in any trade-offs between speed and quality I choose quality - my usage volumes are low and I'd better wait for a good result;
I load as much as I can into VRAM, leaving 1-3GB for the system and context.