this post was submitted on 05 Nov 2024

74 points (97.4% liked)

Selfhosted

40669 readers

391 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

Self-hosting LLMs (lemmy.zip)

submitted 1 month ago by [email protected] to c/selfhosted

21 comments fedilink hide all child comments

I'd like to self host a large language model, LLM.

I don't mind if I need a GPU and all that, at least it will be running on my own hardware, and probably even cheaper than the $20 everyone is charging per month.

What LLMs are you self hosting? And what are you using to do it?

all 22 comments

sorted by: hot top controversial new old

[–] [email protected] 23 points 1 month ago* (last edited 1 month ago) (3 children)

I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Edit: I'm simplifying it very much, but hopefully should it is simple and actionable as a starting point. I've also seen great stuff from Gemma2-27B

Edit2: added links

Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I'm not sure about the compatibility for these tools, so perhaps others can chime in.

Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.

[–] [email protected] 2 points 1 month ago

I run ollama:rocm and deepseek-coder model on Radeon 6700XT. I only had to set the GPU via environment variables because it is not officially supported by ROCm, but it works.

[–] [email protected] 1 points 1 month ago (1 children)

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Why can't it serve multiple users? Open Web UI seems to support multiple users.

[–] [email protected] 3 points 1 month ago (1 children)

I didn't say it can't. But I'm not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it's a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.

[–] [email protected] 1 points 1 month ago

I see. Makes sense.

[–] [email protected] 10 points 1 month ago

LLMs use a ton of VRAM, the more VRAM you have the better.

If you just need an API, then TabbyAPI is pretty great.

If you need a full UI, then Oogabooga's TextGenration WebUI is a good place to start

[–] InverseParallax 7 points 1 month ago* (last edited 1 month ago)

Ollama, llama3.2, deepcode and a bunch of others.

Using a GPU but man they're picky, they mostly want Nvidia gpus.

Do NOT be afraid to run on the cpu. It's slow, but for 1 user it's actually mostly fine.

[–] Deckweiss 5 points 1 month ago

GPT4All is a nice and easy start.

[–] [email protected] 5 points 1 month ago* (last edited 1 month ago) (2 children)

You can run this right from Windows: https://jan.ai/

You'll need a lot of RAM, and processing is decently fast, even on a basic laptop.

edit: holy hell. Grammar.

[–] [email protected] 3 points 1 month ago

Tip: you can copy and paste the Hugging Face link directly into the search box, and it will download the model automatically! Also, it’s pretty smart. It will load into your VRAM first, then your RAM. If you can fit everything into VRAM, you get the fastest speed. But even if you are using RAM, it’s not terribly bad; it’s still faster than you can read.

[–] [email protected] 1 points 1 month ago

This is pretty cool!

[–] [email protected] 5 points 1 month ago (1 children)

Using Ollama to try a couple of models right now for an idea. I’ve tried to run Llama 3.2 and Qwen 2.5 3b, both of which fits my 3050 6G’s VRAM. I’ve also tried for fun to use Qwen 2.5 32b, which fits in my RAM (I’ve got 128G) but it was only able to reply a couple of tokens per second, thereby making it very much a non-interactive experience. Will need to explore the response time piece a bit further to see if there are ways I can lean on larger models with longer delays still.

[–] [email protected] 1 points 1 month ago

Please try the 4 bit quantisations of the models. They work a bunch faster while eating less RAM.

Generally you want to use 7B or 8B models on the CPU, since everything above will be hellishly slugish.

[–] [email protected] 4 points 1 month ago (1 children)

If you don’t need to host but can run locally, GPT4ALL is nice, has several models to download and plug and play with different purposes and descriptions, and doesn’t require a GPU.

[–] [email protected] 2 points 1 month ago

I second that. Even my lower-midrange laptop from 3 years ago (8GB RAM, Integrated AMD GPU) can run a few of the smaller LLMs, and it's true that you don't even need a GPU as they can run in RAM. And depending on how much RAM you have and what GPU, you might find models performing better in RAM instead of on the GPU. Just keep in mind that when a model says, for example, 8GB Memory required, if you have 8GB RAM, you can't run it cuz you also have your operating system and other applications running. If you have 8GB video memory on your GPU though, you should be golden (I think).

[–] [email protected] 4 points 1 month ago

My (docker based) configuration:

Linux > Docker Container > Nvidia Runtime > Open WebUI > Ollama > Llama 3.1

Docker: https://docs.docker.com/engine/install/

Nvidia Runtime for docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Open WebUI: https://docs.openwebui.com/

Ollama: https://hub.docker.com/r/ollama/ollama

[–] TomAwezome 4 points 1 month ago

TinyLLM on a separate computer with 64GB RAM and a 12-core AMD Ryzen 5 5500GT, using the rocket-3b.Q5_K_M.gguf model, runs very quickly. Most of the RAM is used up by other programs I run on it, the LLM doesn't take the lion's share. I used to self host on just my laptop (5+ year old Thinkpad with upgraded RAM) and it ran OK with a few models but after a few months saved up for building a rig just for that kind of stuff to improve performance. All CPU, not using GPU, even if it would be faster, since I was curious if CPU-only would be usable, which it is. I also use the LLama-2 7b model or the 13b version, the 7b model ran slow on my laptop but runs at a decent speed on a larger rig. The less billions of parameters, the more goofy they get. Rocket-3b is great for quickly getting an idea of things, not great for copy-pasters. LLama 7b or 13b is a little better for handing you almost-exactly-correct answers for things. I think those models are meant for programming, but sometimes I ask them general life questions or vent to them and they receive it well and offer OK advice. I hope this info is helpful :)

[–] [email protected] 2 points 1 month ago

I am not self hosting an LLM, but running on my laptop with Alpaca. Google's Gemma 2B. On my hardware its pretty slow, but kind of gets the work done. My hardware is getting old, need to upgrade soon.

[–] [email protected] 2 points 1 month ago

I run locally mistral-nemo in my 1070-ti

[–] [email protected] 1 points 1 month ago

I got a home server with a Nvidia Tesla P4, not the most power or the most vram (8gb), but can be gotten for ~$100usd (it is a headless GPU so no video outputs)

I'm using ollama with dolphin-mistral and recently deepseek coder

[–] [email protected] 1 points 1 month ago

GPT4All and Jan.AI are good places to start.