ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.

It is highly optimized model loader for GPTQ models. It's an alternative to options like AutoGPTQ or GPTQ-for-LLaMA, and provides faster text generation speeds.

With this update, anyone running GGML exclusively might find some interesting results switching over to a quantized model and testing the changes. I haven't had a chance yet myself, but I will post some of my own benchmarks and results if I find the time for it.

I for one am excited to see the efficiency battles begin. Getting compute down is going to be the most important hurdles to overcome.

you are viewing a single comment's thread
view the rest of the comments

[–] Blaed 3 points 2 years ago* (last edited 2 years ago)

A comment from a Reddit user (Fuzzlewhumper) regarding these changes:

What would take me 2-3 minutes of wait time for a GGML 30B model takes 6-8 seconds pause followed by super fast text from the model - 6-8 tokens a second at least. Faster than I normally type. Yup, had it describe the characters, big old paragraph, 7.41 tokens on my 2015 machine with 32gb memory, I7-6700, and a couple cheap 3060 RTX cards. SCORE.

I would be curious to see if the efficiency change is that drastic. I will do my best to include my findings in the larger model benchmark post I am piecing together.