this post was submitted on 28 Feb 2024
39 points (97.6% liked)
LocalLLaMA
2326 readers
1 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
They said their's is "comparable with the 8-bit models". Its all tradeoffs. It isn't clear to me where you allocate your compute/memory budget. I've noticed that full 7b 16 bit models often produce better results for me than some much larger quantied models. It will be interesting to find the sweet spot.
I can't find that mention of "8-bit models" anywhere in the paper, just by skimming it again I only see references and comparisons to FP16.
I know these discussions from llama.cpp and ggml quantization. With that you can quantize a model more and more and it becomes worse the lower the precision gets. You can counter that by using a larger model that was more "intelligent" in the first place... With that you can calculate the sweet spot and what gives you the best quality at a certain compute cost or size... A more degraded bigger model, or a less degraded smaller model...
But we don't have different quantization levels here, just one. And it's also difficult to compare, as with ggml you take the same model and quantize it to different levels... We also don't have that here, you can't take an existing model with this approach and quantize it and compare it to another... You have to train a new model from scratch. And then it's a different model.
I can't find a good analogy here... Maybe it's a bit like asking if the filesize of an JPEG image is more important than the resolution... It's kind of the wrong question. You can compare different compression levels of the JPEG image, or compare the size of the JPEG to a BMP file... It's really not a good analogy, but a BMP file with 20 times the size looks exactly like a smaller JPEG file on the screen. And you can also have a 7B parameter LLM model give better answers than a poor (or older) 13B model. It's neither just parameter count nor presision alone.
So if they say they can do with less than a third of the RAM and compute time and simultansously score a tiny bit higher in the benchmarks, I don't see a tradeoff here.
Generally speaking you can ask the question: What delivers the best results with at a given compute cost. Or the other way around: What has the lowest cost to arrive at a certain point. But this is kind of a different technique, same parameter count, same results, but significantly lower computing cost on inference.
(And reading all the speculation elsewhere: There might be a different tradeoff. The authors didn't talk about training and just made very small models. A more complex and expensive training process could be a tradeoff.)
Apparently I am an idiot and read the wrong paper. The previous paper mentioned that "comparable with the 8-bit models"
https://huggingface.co/papers/2310.11453