LocalLLaMA

2086 readers
14 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS
1
 
 

Trying something new, going to pin this thread as a place for beginners to ask what may or may not be stupid questions, to encourage both the asking and answering.

Depending on activity level I'll either make a new one once in awhile or I'll just leave this one up forever to be a place to learn and ask.

When asking a question, try to make it clear what your current knowledge level is and where you may have gaps, should help people provide more useful concise answers!

2
67
submitted 11 months ago by Blaed to c/[email protected]
 
 

cross-posted from: https://lemmy.world/post/2219010

Hello everyone!

We have officially hit 1,000 subscribers! How exciting!! Thank you for being a member of [email protected]. Whether you're a casual passerby, a hobby technologist, or an up-and-coming AI developer - I sincerely appreciate your interest and support in a future that is free and open for all.

It can be hard to keep up with the rapid developments in AI, so I have decided to pin this at the top of our community to be a frequently updated LLM-specific resource hub and model index for all of your adventures in FOSAI.

The ultimate goal of this guide is to become a gateway resource for anyone looking to get into free open-source AI (particularly text-based large language models). I will be doing a similar guide for image-based diffusion models soon!

In the meantime, I hope you find what you're looking for! Let me know in the comments if there is something I missed so that I can add it to the guide for everyone else to see.


Getting Started With Free Open-Source AI

Have no idea where to begin with AI / LLMs? Try starting with our Lemmy Crash Course for Free Open-Source AI.

When you're ready to explore more resources see our FOSAI Nexus - a hub for all of the major FOSS & FOSAI on the cutting/bleeding edges of technology.

If you're looking to jump right in, I recommend downloading oobabooga's text-generation-webui and installing one of the LLMs from TheBloke below.

Try both GGML and GPTQ variants to see which model type performs to your preference. See the hardware table to get a better idea on which parameter size you might be able to run (3B, 7B, 13B, 30B, 70B).

8-bit System Requirements

Model VRAM Used Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 9.2GB 10GB 3060 12GB, 3080 10GB 24 GB
LLaMA-13B 16.3GB 20GB 3090, 3090 Ti, 4090 32 GB
LLaMA-30B 36GB 40GB A6000 48GB, A100 40GB 64 GB
LLaMA-65B 74GB 80GB A100 80GB 128 GB

4-bit System Requirements

Model Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 6GB GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060 6 GB
LLaMA-13B 10GB AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 12 GB
LLaMA-30B 20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 32 GB
LLaMA-65B 40GB A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 64 GB

*System RAM (not VRAM), is utilized to initially load a model. You can use swap space if you do not have enough RAM to support your LLM.

When in doubt, try starting with 3B or 7B models and work your way up to 13B+.

FOSAI Resources

Fediverse / FOSAI

LLM Leaderboards

LLM Search Tools


Large Language Model Hub

Download Models

oobabooga

text-generation-webui - a big community favorite gradio web UI by oobabooga designed for running almost any free open-source and large language models downloaded off of HuggingFace which can be (but not limited to) models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and many others. Its goal is to become the AUTOMATIC1111/stable-diffusion-webui of text generation. It is highly compatible with many formats.

Exllama

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.

gpt4all

Open-source assistant-style large language models that run locally on your CPU. GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer-grade processors.

TavernAI

The original branch of software SillyTavern was forked from. This chat interface offers very similar functionalities but has less cross-client compatibilities with other chat and API interfaces (compared to SillyTavern).

SillyTavern

Developer-friendly, Multi-API (KoboldAI/CPP, Horde, NovelAI, Ooba, OpenAI+proxies, Poe, WindowAI(Claude!)), Horde SD, System TTS, WorldInfo (lorebooks), customizable UI, auto-translate, and more prompt options than you'd ever want or need. Optional Extras server for more SD/TTS options + ChromaDB/Summarize. Based on a fork of TavernAI 1.2.8

Koboldcpp

A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint. What does it mean? You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything Kobold and Kobold Lite have to offer. In a tiny package around 20 MB in size, excluding model weights.

KoboldAI-Client

This is a browser-based front-end for AI-assisted writing with multiple local & remote AI models. It offers the standard array of tools, including Memory, Author's Note, World Info, Save & Load, adjustable AI settings, formatting options, and the ability to import existing AI Dungeon adventures. You can also turn on Adventure mode and play the game like AI Dungeon Unleashed.

h2oGPT

h2oGPT is a large language model (LLM) fine-tuning framework and chatbot UI with document(s) question-answer capabilities. Documents help to ground LLMs against hallucinations by providing them context relevant to the instruction. h2oGPT is fully permissive Apache V2 open-source project for 100% private and secure use of LLMs and document embeddings for document question-answer.


Models

The Bloke

The Bloke is a developer who frequently releases quantized (GPTQ) and optimized (GGML) open-source, user-friendly versions of AI Large Language Models (LLMs).

These conversions of popular models can be configured and installed on personal (or professional) hardware, bringing bleeding-edge AI to the comfort of your home.

Support TheBloke here.


70B


30B


13B


7B


More Models


GL, HF!

Are you an LLM Developer? Looking for a shoutout or project showcase? Send me a message and I'd be more than happy to share your work and support links with the community.

If you haven't already, consider subscribing to the free open-source AI community at [email protected] where I will do my best to make sure you have access to free open-source artificial intelligence on the bleeding edge.

Thank you for reading!

3
 
 

Hello y'all, i was using this guide to try and set up llama again on my machine, i was sure that i was following the instructions to the letter but when i get to the part where i need to run setup_cuda.py install i get this error

File "C:\Users\Mike\miniconda3\Lib\site-packages\torch\utils\cpp_extension.py", line 2419, in _join_cuda_home raise OSError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root. (base) PS C:\Users\Mike\text-generation-webui\repositories\GPTQ-for-LLaMa>

i'm not a huge coder yet so i tried to use setx to set CUDA_HOME to a few different places but each time doing echo %CUDA_HOME doesn't come up with the address so i assume it failed, and i still can't run setup_cuda.py

Anyone have any idea what i'm doing wrong?

4
 
 

You type "Once upon a time!!!!!!!!!!" and those exclamation marks are rendered to show the LLM generated text, using a tiny 30MB model

via https://simonwillison.net/2024/Jun/23/llama-ttf/

5
6
7
 
 

Hello! I am looking for some expertise from you. I have a hobby project where Phi-3-vision fits perfectly. However, the PyTorch version is a little too big for my 8GB video card. I tried looking for a quantized model, but all I found is 4-bit. Unfortunately, this model works too poorly for me. So, for the first time, I came across the task of quantizing a model myself. I found some guides for Phi-3V quantization for ONNX. However, the only options are fp32(?), fp16, int4. Then, I found a nice tool for AutoGPTQ but couldn't make it work for the job yet. Does anybody know why there is no int8/int6 quantization for Phi-3-vision? Also, has anybody used AutoGPTQ for quantization of vision models?

8
 
 

"Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"

The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some children.

The scientists posed varying versions of this simple problem to various State-Of-the-Art LLMs that claim strong reasoning capabilities. (GPT-3.5/4/4o , Claude 3 Opus, Gemini, Llama 2/3, Mistral and Mixtral, including very recent Dbrx and Command R+)

They observed a strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses.

This breakdown can be considered to be dramatic not only because it happens on such a seemingly simple problem, but also because models tend to express strong overconfidence in reporting their wrong solutions as correct, while often providing confabulations to additionally explain the provided final answer, mimicking reasoning-like tone but containing nonsensical arguments as backup for the equally nonsensical, wrong final answers.

9
 
 

Remember 2-3 years ago when OpenAI had a website called transformer that would complete a sentence to write a bunch of text. Most of it was incoherent but I think it is important for historic and humor purposes.

10
 
 


So here's the way I see it; with Data Center profits being the way they are, I don't think Nvidia's going to do us any favors with GPU pricing next generation. And apparently, the new rule is Nvidia cards exist to bring AMD prices up.

So here's my plan. Starting with my current system;

OS: Linux Mint 21.2 x86_64  
CPU: AMD Ryzen 7 5700G with Radeon Graphics (16) @ 4.673GHz  
GPU: NVIDIA GeForce RTX 3060 Lite Hash Rate  
GPU: AMD ATI 0b:00.0 Cezanne  
GPU: NVIDIA GeForce GTX 1080 Ti  
Memory: 4646MiB / 31374MiB

I think I'm better off just buying another 3060 or maybe 4060ti/16. To be nitpicky, I can get 3 3060s for the price of 2 4060tis and get more VRAM plus wider memory bus. The 4060ti is probably better in the long run, it's just so damn expensive for what you're actually getting. The 3060 really is the working man's compute card. It needs to be on an all-time-greats list.

My limitations are that I don't have room for full-length cards (a 1080ti, at 267mm, just barely fits), also I don't want the cursed power connector. Also, I don't really want to buy used because I've lost all faith in humanity and trust in my fellow man, but I realize that's more of a "me" problem.

Plus, I'm sure that used P40s and P100s are a great value as far as VRAM goes, but how long are they going to last? I've been using GPGPU since the early days of LuxRender OpenCL and Daz Studio Iray, so I know that sinking feeling when older CUDA versions get dropped from support and my GPU becomes a paperweight. Maxwell is already deprecated, so Pascal's days are definitely numbered.

On the CPU side, I'm upgrading to whatever they announce for Ryzen 9000 and a ton of RAM. Hopefully they have some models without NPUs, I don't think I'll need them. As far as what I'm running, it's Ollama and Oobabooga, mostly models 32Gb and lower. My goal is to run Mixtral 8x22b but I'll probably have to run it at a lower quant, maybe one of the 40 or 50Gb versions.

My budget: Less than Threadripper level.

Thanks for listening to my insane ramblings. Any thoughts?

11
 
 

It actually isn't half bad depending on the model. It will not be able to help you with side streets but you can ask for the best route from Texas to Alabama or similar. The results may surprise you.

12
 
 

Current situation: I've got a desktop with 16 GB of DDR4 RAM, a 1st gen Ryzen CPU from 2017, and an AMD RX 6800 XT GPU with 16 GB VRAM. I can 7 - 13b models extremely quickly using ollama with ROCm (19+ tokens/sec). I can run Beyonder 4x7b Q6 at around 3 tokens/second.

I want to get to a point where I can run Mixtral 8x7b at Q4 quant at an acceptable token speed (5+/sec). I can run Mixtral Q3 quant at about 2 to 3 tokens per second. Q4 takes an hour to load, and assuming I don't run out of memory, it also runs at about 2 tokens per second.

What's the easiest/cheapest way to get my system to be able to run the higher quants of Mixtral effectively? I know that I need more RAM Another 16 GB should help. Should I upgrade the CPU?

As an aside, I also have an older Nvidia GTX 970 lying around that I might be able to stick in the machine. Not sure if ollama can split across different brand GPUs yet, but I know this capability is in llama.cpp now.

Thanks for any pointers!

13
 
 

Recently OpenAI released GPT-4o

Video I found explaining it: https://youtu.be/gy6qZqHz0EI

Its a little creepy sometimes but the voice inflection is kind of wild. What I the to be alive.

14
15
 
 

I am planning my first ai-lab setup, and was wondering how many tokens different AI-workflows/agent network eat up on an average day. For instance talking to an AI all day, have devlin running 24/7 or whatever local agent workflow is running.

Oc model inference speed and type of workflow influences most of these networks, so perhaps it's easier to define number of token pr project/result ?

So I were curious about what typical AI-workflow lemmies here run, and how many tokens that roughly implies on average, or on a project level scale ? Atmo I don't even dare to guess.

Thanks..

16
17
 
 

Hartford is credited as creator of Dolphin-Mistral, Dolphin-Mixtral and lots of other stuff.

He's done a huge amount of work on uncensored models.

18
19
20
28
submitted 2 months ago* (last edited 2 months ago) by [email protected] to c/[email protected]
 
 

From Simon Willison: "Mistral tweet a link to a 281GB magnet BitTorrent of Mixtral 8x22B—their latest openly licensed model release, significantly larger than their previous best open model Mixtral 8x7B. I’ve not seen anyone get this running yet but it’s likely to perform extremely well, given how good the original Mixtral was."

21
22
23
 
 

I've been using tie-fighter which hasn't been too bad with lorebooks in tavern.

24
25
 
 

Afaik most LLMs run purely on the GPU, dont they?

So if I have an Nvidia Titan X with 12GB of RAM, could I plug this into my laptop and offload the load?

I am using Fedora, so getting the NVIDIA drivers would be... fun and already probably a dealbreaker (wouldnt want to run proprietary drivers on my daily system).

I know that using ExpressPort adapters people where able to use GPUs externally, and this is possible with thunderbolt too, isnt it?

The question is, how well does this work?

Or would using a small SOC to host a webserver for the interface and do all the computing on the GPU make more sense?

I am curious about the difficulties here, ARM SOC and proprietary drivers? Laptop over USB-c (maybe not thunderbolt?) and a GPU just for the AI tasks...

view more: next ›