this post was submitted on 18 Sep 2024
35 points (100.0% liked)

LocalLLaMA

2240 readers
11 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS
 

Mistral Small 22B just dropped today and I am blown away by how good it is. I was already impressed with Mistral NeMo 12B's abilities, so I didn't know how much better a 22B could be. It passes really tough obscure trivia that NeMo couldn't, and its reasoning abilities are even more refined.

With Mistral Small I have finally reached the plateu of what my hardware can handle for my personal usecase. I need my AI to be able to at least generate around my base reading speed. The lowest I can tolerate is 1.5~T/s lower than that is unacceptable. I really doubted that a 22B could even run on my measly Nvidia GTX 1070 8G VRRAM card and 16GB DDR4 RAM. Nemo ran at about 5.5t/s on this system, so how would Small do?

Mistral Small Q4_KM runs at 2.5T/s with 28 layers offloaded onto VRAM. As context increases that number goes to 1.7T/s. It is absolutely usable for real time conversation needs. I would like the token speed to be faster sure, and have considered going with the lowest Q4 recommended to help balance the speed a little. However, I am very happy just to have it running and actually usable in real time. Its crazy to me that such a seemingly advanced model fits on my modest hardware.

Im a little sad now though, since this is as far as I think I can go in the AI self hosting frontier without investing in a beefier card. Do I need a bigger smarter model than Mistral Small 22B? No. Hell, NeMo was serving me just fine. But now I want to know just how smart the biggest models get. I caught the AI Acquisition Syndrome!

you are viewing a single comment's thread
view the rest of the comments
[–] brucethemoose 2 points 1 month ago* (last edited 1 month ago) (1 children)

A Qwen 2.5 14B IQ3_M should completely fit in your VRAM, with longish context, with acceptable quality.

An IQ4_XS will just barely overflow but should still be fast at short context.

And while I have not tried it yet, the 14B is allegedly smart.

Also, what I do on my PC is hook up my monitor to the iGPU so the GPU's VRAM is completely empty, lol.

[–] Smokeydope 2 points 1 month ago* (last edited 1 month ago) (1 children)

Hey @brucethemoose hope you don't mind if I ding you one more time. Today I loaded up with qwen 14b and 32b. Yes, 32B (Q3_KS). I didn't do much testing with 14B but it spoke well and fast. Was more excited to play with the 32B once I found out it would run to be honest. It just barely makes the mark of tolerable speed just under 2T/s (really more like 1.7 with some context loaded in). I really do mean barely, the people who think 5t/s is slow would eat their heart out. However that reasoning and coherence though? Off the charts. I like the way it speaks more than mistral small too. So wow just wow is all I can say. Can't believe all the good models that came out in such a short time and leaps made in the past two months. Thank you again for recommending qwen don't think I would have tried the 32B without your input.

[–] brucethemoose 1 points 1 month ago* (last edited 1 month ago)

Good! Try the IQM, XS, and XSS quantizations as well, especially if you try a 14B, as they "squeeze" the model into less space better than the Q3_K quantizations.

Yeah I'm liking the 32B as well. If you are looking for speed just for ultilitarian Q/A, you might want to keep a Deepseek Lite V2 Code GGUF on hand, as it's uber fast partially offloaded.