LocalLLaMA

2218 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

[email protected]

Any way to prune LLMs? (sh.itjust.works)

submitted 1 year ago by [email protected] to c/[email protected]

2 comments fedilink hide all child comments

Hey, I'm working on some local LLM applications and my goal is to run the smallest model possible without crippling performance. I'm already using 4 bit GPTQ but I want something smaller. These models have been trained on such a massive amount of data but my specific use case only touches a very very small fraction of that, so I would imagine it's possible to cut away large chunks of the model that I don't care about. I'm wondering if there has been any work on runtime pruning of LLMs (not just static pruning based on model weights) based on "real world" data. Something like: you run the model a bunch of times with your actual data and monitor the neuron activations to inform some kind of pruning process. Does anyone here know about something like that?

top 2 comments

sorted by: hot top controversial new old

[–] Zeth0s 2 points 1 year ago

The closest that I know is distillation, you can google to get few resources (e.g. https://huggingface.co/papers/2306.08543). I don't know if it is what you are looking for

[–] [email protected] 2 points 1 year ago

I don't know about that, but you could try GGML (llama.cpp). It has quantization up to 2-bits so that might be small enough.