From a recent PR by oobabooga:
This is what I get with 24gb vram (I haven't tested extensively, it may be possible to go higher):
Model |
Params |
Maximum context |
llama-13b |
max_seq_len = 8192, compress_pos_emb = 4 |
6079 tokens |
llama-30b |
max_seq_len = 3584, compress_pos_emb = 2 |
3100 tokens |
I also removed the chat_prompt_size parameter, since truncation_length can be reused for its purpose.
Now possible in text-generation-webui after this PR: https://github.com/oobabooga/text-generation-webui/pull/2875
I didn't do anything other than exposing the compress_pos_emb parameter implemented by turboderp here, which in turn is based on kaiokendev's recent discovery: https://kaiokendev.github.io/til#extending-context-to-8k
How to use it
-
Open the Model tab, set the loader as ExLlama
or ExLlama_HF
.
-
Set max_seq_len
to a number greater than 2048
. The length that you will be able to reach will depend on the model size and your GPU memory.
-
Set compress_pos_emb
to max_seq_len / 2048
. For instance, use 2
for max_seq_len = 4096
, or 4
for max_seq_len = 8192
.
-
Select the model that you want to load.
-
Set truncation_length
accordingly in the Parameters
tab. You can set a higher default for this parameter by copying settings-template.yaml
to settings.yaml
in your text-generation-webui
folder, and editing the values in settings.yaml
.
-
Those two new parameters can also be used from the command-line. For instance: python server.py --max_seq_len 4096 --compress_pos_emb 2. -