this post was submitted on 25 Jul 2024
642 points (99.1% liked)

196

16708 readers
2336 users here now

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

^other^ ^rules^

founded 2 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 88 points 4 months ago (31 children)

I doubt this person actually had a computer than could run the 405b model. You need over 200gb of ram, let alone having enough vram to run it with gpu acceleration.

[–] [email protected] 3 points 4 months ago (3 children)

there are other options less ram consuming?

[–] PumpkinEscobar 9 points 4 months ago (1 children)

There's quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

There's also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc... It's a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

Then there are some neat projects to distribute models across multiple computers like exo and petals. They're more targeted at a p2p-style random collection of computers. I've run petals in a small cluster and it works reasonably well.

[–] AdrianTheFrog 1 points 4 months ago

Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you're probably better of going with a smaller model anyways

load more comments (1 replies)
load more comments (28 replies)