this post was submitted on 25 Jul 2024
642 points (99.1% liked)
196
16708 readers
2336 users here now
Be sure to follow the rule before you head out.
Rule: You must post before you leave.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I doubt this person actually had a computer than could run the 405b model. You need over 200gb of ram, let alone having enough vram to run it with gpu acceleration.
there are other options less ram consuming?
There's quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.
There's also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc... It's a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b
Then there are some neat projects to distribute models across multiple computers like exo and petals. They're more targeted at a p2p-style random collection of computers. I've run petals in a small cluster and it works reasonably well.
Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you're probably better of going with a smaller model anyways