this post was submitted on 17 Jun 2023
8 points (90.0% liked)

Free Open-Source Artificial Intelligence

2900 readers
1 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

FOSAI Time Capsule

founded 1 year ago
MODERATORS
 

https://github.com/oobabooga/text-generation-webui/commit/9f40032d32165773337e6a6c60de39d3f3beb77d

ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.

It is highly optimized model loader for GPTQ models. It's an alternative to options like AutoGPTQ or GPTQ-for-LLaMA, and provides faster text generation speeds.

With this update, anyone running GGML exclusively might find some interesting results switching over to a quantized model and testing the changes. I haven't had a chance yet myself, but I will post some of my own benchmarks and results if I find the time for it.

I for one am excited to see the efficiency battles begin. Getting compute down is going to be the most important hurdles to overcome.

you are viewing a single comment's thread
view the rest of the comments
[–] Blaed 2 points 1 year ago* (last edited 1 year ago)

I'm with you there. I love how Mosaic just fed the entire Great Gatsby to StoryWriter. This is the sort of context length I need in my life. Would make my projects so much easier. I don't think we're too far from having it on consumer hardware.

You should check out my latest post - which ironically addresses parts of your first comment, but you still need a lot of VRAM... 6000+ tokens context is now possible with ExLlama.

It's crazy to see how fast these developments are happening!