AI

4006 readers

1 users here now

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

founded 3 years ago

LLM ASICs on USB sticks? (lemmy.ml)

submitted 5 months ago by [email protected] to c/[email protected]

14 comments fedilink hide all child comments

Source: nostr

https://snort.social/nevent1qqsg9c49el0uvn262eq8j3ukqx5jvxzrgcvajcxp23dgru3acfsjqdgzyprqcf0xst760qet2tglytfay2e3wmvh9asdehpjztkceyh0s5r9cqcyqqqqqqgt7uh3n

Paper: https://arxiv.org/abs/2406.02528

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 3 points 5 months ago (4 children)

That would actually be insane. Right now, I still need my GPU and about 8-10 gigs of VRAM to run a 7B model tho, so idk how that's supposed to work on a phone. Still, being able to run a model that's as good as a 70B model but with the speed and memory usage of a 7B model would be huge.

[–] [email protected] 4 points 5 months ago (1 children)

I only need ~4 GB of RAM/VRAM for a 7B model, my GPU only has 6GB VRAM anyway. 7B models are smaller than you think, or you have a very inefficient setup.

[–] [email protected] 4 points 5 months ago (1 children)

That's weird, maybe I actually am doing something wrong. Is it because I'm using GGUF models maybe?

[–] [email protected] 1 points 5 months ago (1 children)

llama2 gguf with 2bit quantisation only needs ~5gb vram. 8bits need >9gb. Anything inbetween is possible. There are even 1.5bit and even 1bit options (not gguf AFAIK). Generally fewer bits means worse results though.

[–] [email protected] 1 points 5 months ago

Yeah, I usually take the 6bit quants, didn't know the difference is that big. That's probably why tho. Unfortunately, almost all Llama3 models are either 8B or 70B, so there isn't really anything in between but I find Llama3 models to be noticeably better than Llama2 models, otherwise I would have tried bigger models with lower quants.

[–] [email protected] 2 points 1 month ago* (last edited 1 month ago) (1 children)

I'm even more excited for running 8B models at the speed of 1B! Laughably fast ok-quality generations in JSON format would be crazy useful.

Also yeah, that 7B on mobile was not the best example. Again, probably 1B to 3B is the sweetspot for mobile (I'm running Qwen2.5 0.5B on my phone and it works tel real for simple JSON)

EDIT: And imagine the context lengths we would be ablentonrun on our GPUs at home! What a time to be alive.

[–] [email protected] 2 points 1 month ago (1 children)

Being able to run 7B quality models on your phone would be wild. It would also make it possible to run those models on my server (which is just a mini pc), so I could connect it to my Home Assistant voice assistant, which would be really cool.

[–] [email protected] 1 points 4 weeks ago (1 children)

Something similar to this already kinda exists on HF with the 1.58 bit quantisation which seem to get very similar performance to the original Llama 3 8B model. That's essentially a two bit quanitsation with reasonable performance!

[–] [email protected] 2 points 4 weeks ago

That's really interesting, gonna try out how well it runs

[–] [email protected] 2 points 4 weeks ago

Slowly, is how

[–] [email protected] 1 points 5 months ago (1 children)

I have never worked on machine learning, what does the B stand for? Billion? Bytes?

[–] [email protected] 2 points 5 months ago (1 children)

I think it's how many billion parameters the model has

[–] [email protected] 1 points 5 months ago

Thanks!