brucethemoose

joined 10 months ago
[–] brucethemoose -1 points 1 hour ago* (last edited 1 hour ago) (3 children)

Oh, now I remember. His press conferences during the first time were like rallies, just blurting random shower thoughts out.

…If the Dems had just advertised his first term conferences verbatim, no extra context, no picking and choosing, just shown them, they would have easily won.

[–] brucethemoose 1 points 1 hour ago* (last edited 1 hour ago)

I mean, if you have huge GPU, sure. Or at least 12GB free vram or a big Mac.

Local LLMs for coding is kinda a niche because most people don’t have a 3090 or 7900 lying around, and you really need 12GB+ free VRAM for the models to start being "smart" and even worth using over free LLM APIs, much less cheap paid ones.

But if you do have the hardware and the time to set a server up, the Deepseek R1 models or the FuseAI merges are great for "slow" answers where the model thinks things out for replying. Qwen 2.5 32B coder is great for quick answers on 24GB VRAM. Arcee 14B is great for 12GB VRAM.

Sometimes running a small model on a "fast" less vram efficient backend is better for stuff like cursor code completion.

[–] brucethemoose -4 points 3 hours ago* (last edited 3 hours ago) (1 children)

I’m not going to mock and belittle people who were losing their loved ones as a result of Democrat policies choosing not to partake in the election.

I am.

Many seemed to think that Trump and Biden/Harris had roughly "equivalent" Gaza policy, as evidenced by their surprise at Trump's actions. That is dangerous misinformation, and it deserves to be called out as a mistake, no matter how tragic the consequences are.

I view our election system as a rigged game, and not participating in swing states as a protest is all but choosing a side, in my eyes. Especially when the consequences are so personally important. The opportunity cost is Democrat votes, and Trump's gain. In that spirit, I actually registered as a Republican in my home state, only so I could vote against Trumpist politicians wherever I can, as there's basically no hope for Democrats and my vote would be "thrown away," relatively speaking. And I can still vote against Republicans in the general election and some other offices that have a shot.

[–] brucethemoose -2 points 4 hours ago* (last edited 4 hours ago) (4 children)

Friend, I vote in primaries, I spout third parties or wings of the Democrats or even Republicans, and how rigged the primary system is, every chance I get. I am not a Biden apologist.

But November 2024 was too freaking late. When your country is getting Hitler or half Hilter, you vote for half Hitler, not whine about them and try to get other half Hitler voters to stop when it’s not going to do a thing.

Does it entrench the system? Sure.

[–] brucethemoose -3 points 5 hours ago* (last edited 5 hours ago) (8 children)

More whatabouts.

For all their sins, Harris (or Biden) would not have tried to specifically deport Palestine protestors, or very specifically and openly support ending Palestine.

And this is just the beginning. That language from the WH sounds like a good setup for classifying Palestine protestors as terrorists.

[–] brucethemoose -2 points 5 hours ago (11 children)

https://en.wikipedia.org/wiki/Gaza_war_protest_vote_movements#Withdrawal_of_Joe_Biden

On the other hand, Abandon Harris endorsed Green Party candidate Jill Stein, who said she would end all military support to Israel if elected, and the group said that it was "confronting two destructive forces: one currently overseeing a genocide and another equally committed to continuing it"

Following the loss of Harris, many in the movement felt vindication. Significant portions of the electorate in Dearborn, Michigan, an Arab American majority city, did not vote for Harris.[77] Muslims who voted for Trump, and were thus pivotal in helping him win the three key states of the Rust Belt (Michigan, Pennsylvania, and Wisconsin being Harris's clearer path for a narrow win in the Electoral College), were subsequently upset that Trump nominated pro-Israel cabinet picks...

4
submitted 5 hours ago* (last edited 5 hours ago) by brucethemoose to c/cachyos
 

Hey, I have nothing to do with CachyOS or this Lemmy community, but just wanna say I love this distro.

It's everything annoying about Arch Linux (to me) fixed, more convenient, and objectively fast as heck. I distro hopped for a long time, but have zero inclination to switch after finding CachyOS. I hardly need to tweak anything. It's all optimal out of the box! And how many other distros offer their own AVX2/AVX512 packages by default?

...I haven't even reinstalled CachyOS on my main PC for almost two years. I can't say that for Ubuntu, Fedora, PopOS, or (heaven forbid) Manjaro, all of which are more ostensibly stable yet always seem to break, or get behind on fixes I need. Kionite was too finicky with the whole immutable thing. Garuda Linux was OK, but more bloated, and not nearly as "optimally preconfigured" as CachyOS.

[–] brucethemoose 3 points 6 hours ago* (last edited 6 hours ago)

Oof... Thanks. I appreciate the history lesson, as they did not teach that little detail in my schools.

[–] brucethemoose 57 points 7 hours ago* (last edited 7 hours ago) (11 children)

This really is reminiscent of early Nazi Germany, with an obsession over trans people (like Jews), and the idea that they're the root of so much evil, and the constant implication that things would be better if they just go away...

[–] brucethemoose 7 points 9 hours ago

That’s the whole point. Deflect real controversy with stupid sound bites.

[–] brucethemoose 27 points 10 hours ago* (last edited 10 hours ago) (2 children)

My friend, the Chinese have been releasing amazing models all last year, it just didn’t make headlines.

Tencent's Hunyuan Video is incredible. Alibabas Qwen is still a go to local model. I've used InternLM pretty regularly… Heck, Yi 32B was awesome in 2023, as the first decent long context local model.

…The Janus models are actually kind of meh, unless you're captioning images, and FLUX/Hunyuan Video is still king in diffusion world.

[–] brucethemoose 1 points 18 hours ago

Basically the world is waiting for the Nvidia monopoly to break and training costs to come down, then we will see...

[–] brucethemoose 2 points 20 hours ago* (last edited 20 hours ago) (2 children)

Depends what you mean by "AI"

Generative models as you know them are pretty much all transformers, and there are already many hacks to let them ingest images, video, sound/music, and even other formats. I believe there are some dedicated 3D models out there, as well as some experiments with "byte-level" LLMs that can theoretically take any data format.

But there are fundamental limitations, like the long context you'd need for 3D model ingestion being inefficient. The entities that can afford to train the best models are "conservative" and tend to shy away from testing exotic implementations, presumably because they might fail.

Some seemingly "solvable" problems like repetition issues you encounter with programming have not had potential solutions adopted either, and the fix they use (literally randomizing the output) makes them fundamentally unreliable. LLMs are great assistants, but you can never fully trust them as is.

What I'm getting at is that everything you said is theoretically possible, but the entities with the purse strings are relatively conservative and tend to pursue profitable pure text performance instead. So I bet they will remain as "interns" and "assistants" until there's a more fundamental architecture shift, maybe something that learns and error corrects during usage instead of being so static.


And as stupid as this sounds, another problem is packaging. There are some incredible models that take media or even 3D as input, for instance... but they are all janky, half functional python repos researchers threw up before moving on. There isn't much integration and user-friendliness in AI land.

40
submitted 2 weeks ago* (last edited 2 weeks ago) by brucethemoose to c/politics
 

Here's the Meta formula:

  • Put a Trump friend on your board (Ultimate Fighting Championship CEO Dana White).
  • Promote a prominent Republican as your chief global affairs officer (Joel Kaplan, succeeding liberal-friendly Nick Clegg, president of global affairs).
  • Align your philosophy with Trump's on a big-ticket public issue (free speech over fact-checking).
  • Announce your philosophical change on Fox News, hoping Trump is watching. In this case, he was. "Meta, Facebook, I think they've come a long way," Trump said at a Mar-a-Lago news conference, adding of Kaplan's appearance on the "Fox and Friends" curvy couch: "The man was very impressive."
  • Take a big public stand on a favorite issue for Trump and MAGA (rolling back DEI programs).
  • Amplify that stand in an interview with Fox News Digital. (Kaplan again!)
  • Go on Joe Rogan's podcast and blast President Biden for censorship.
16
submitted 1 month ago* (last edited 1 month ago) by brucethemoose to c/enoughmuskspam
 

Taboola's data, shared exclusively with Axios, shows Musk has outpaced his closest peers — Jeff Bezos and Mark Zuckerberg — for years, but the gap widened dramatically in 2024.

The spam is already exponential. :(

372
submitted 1 month ago* (last edited 1 month ago) by brucethemoose to c/politics
 

Reality check: Trump pledged to end the program in 2016.

Called it. When push comes to shove, Trump is always going to side with the ultra-rich.

 

Trump, who has remained silent thus far on the schism, faces a quickly deepening conflict between his richest and most powerful advisors on one hand, and the people who swept him to office on the other.

All this is stupid. But I know one thing:

Trump is a billionaire.

And I predict his followers are going to learn who he’ll side with when push comes to shove.

Also, Bannon’s take is interesting:

Bannon tells Axios he helped kick off the debate with a now-viral Gettr post earlier this month calling out a lack of support for the Black and Hispanic communities in Big Tech.

 

I think the title explains it all… Even right wing influencers can have their faces eaten. And Twitter views are literally their livelihood.

Trump's conspiracy-minded ally Laura Loomer, New York Young Republican Club president Gavin Wax and InfoWars host Owen Shroyer all said their verification badges disappeared after they criticized Musk's support for H1B visas, railed against Indian culture and attacked Ramaswamy, Musk's DOGE co-chair.

 

I have no idea if anyone on Lemmy is into Avatar lore/fanfiction, but in the spirit of posting content here instead of Reddit... here goes.


A new Avatar series featuring 'twin' Avatars has been leaked, in case you missed it:

https://knightedgemedia.com/2024/12/avatar-seven-havens-twin-earth-avatar-series-will-initially-be-26-episodes-long/

https://lemmy.world/post/23427458

In a nutshell, its allegedly set in a cataclysmic world overrun by spirit vines, and two twins are the 'Avatars' with diametric personalities. Not much is known beyond that, but I've been brainstorming some post-LoK ideas forever.

And now I kinda feel like writing them out. Here's my thought dump for a story:


In the (largely undepicted) three years of canon Korra Alone, Korra (traveling anonymously) makes a stop on Kyoshi Island hoping to reconnect with her spirit. Instead, she meets a humble blacksmith with a bird spirit on his shoulder, and connects with him. They both wrestle with the demons haunting them, and they discover secrets on the island from Kyoshi's era.

Korra dies in 190 AG (at 37), already weakened from her metal poisoning, saving the world froma a cataclysm that leaves much of the world overgrown.

Intially, the story jumps between this period in Korra Alone (174AG) and 206 AG, where Asami Sato is struggles to steer Future Industries in a world dominated by megacorps in the safe 'havens' dotted through the world. While Kyoshi Island has barely changed at all, the 'future' thread has a more cyberpunk feel. Chi based cybernetics are commonplace, but the more augmented someone is, the more their bending is compromised, and bender vs nonbender tensions flare up once more. The world outside the safe havens is a dangerous wasteland. Tech derived from studying spirits has let to the proliferation of holograms, BCIs, and even primitive assistants and virtual environments, and advances in power storage/generation already seen in LoK mean everything is largely electric. Yet the world is still "analog," with tube radios and TVs, no digital electronics, and 'dumb' virtual assistants that are error-prone and incapable of math, giving it a retro feel. There are no guns, of course, but personal weapons like arc casters, flamethrowers, cryo blasters and such all mimic bending.

The Sei'naka clan has risen to power in the Fire Nation, taking advantage of the aftermath of the 100 Years War, the Red Lotus Insurrection, Future Industry's relative benevolence, and even the recently calamity. Now a ruthless corporation bigger than Future Industires, they dominate business and politics wherever they expand.

The White Lotus's search for the Avatar has failed. Asami rather infamously misidentified the Avatar... until one day, she find them.


Once this background is established, the story jumps back to our inseparable twin Avatars, Pavi and Nisha, born deep in the Foggy Swamp. Thanks to their predecessor, they live a harmonious, largely isolated life as members of the Foggy Swamp Tribe on the back of a water Lion Turtle. To her utter shock, they manage to manifest Korra at nine, withh Past Life Korra appearing as a nine-year-old. Dumbfounded, not even sure who the 'real' Avatar is, the girls assume she is just another spirit in the swamp. So Korra makes the decision to go along with this, and let them have a childhood she never had under the White Lotus as she figures out just what's going on with the twin Avatars.

Ultimately, the real world comes crashing into the new Avatars' isolated life, and they react poorly to Korra telling them the truth at 16. Through some more disasters and tragedies, they end up on the streets of Republic City, separated for a time, before meeting friends. The rest of the story revolves around corporate and personal greed (very much like the real wo9rld), conflict (and synergy) between the environment/spirits and technology, rivalries, family, friends across lifetimes, the nature of consciousness, reincarnation and the soul, and a conspiracy going all the way back to Kuruk threading through everything.


Some character profiles I'm working on:

  • Priya: Independent, kind, and resourceful. Priya is a reluctant hero who avoids altercations or fighting, but still believes in helping people using her creativity and wits. She's a talented musician and loves to make up songs with her taanbur (guitar-like instrument). Priya lost most of her leg in an accident from a fallen tree that killed her parents in the Foggy Swamp, but bends roots and muddy water as if they were her own limb. Its eventually revealed that she carries Raava. Once she finds out, Priya in particular is reluctant to accept her role as the Avatar, until a tragedy forces her hand.

  • Nikki is more snarky. She loves her powers and the attention it affords her, but her biggest fear is to be forgotten or not accepted by others. Nikki is awkward but puts on a face of superior confidence to hide the fact that she feels like a fish out of water. Despite her cocky attitude, Nikki is also an innovator, and her wild side is useful at times. Like her sister, she's highly attuned to the swamp, able to connect to and even manifest the collective memories of lost loved ones by touching spirit vines in the swamp. Both are apparently waterbenders with a proclivity for mud. Its eventually revealed that she carries Vaatu inside her. Nikki is missing part of her arm, but bends as a replacement, much line Ming-Hua.

  • Korra: Largely as she is in LoK. Hot blooded, quick to fight, passionate, empathetic, and not very spiritual. In a reversal of roles, Priya and Nikki keep her manifested constantly, and Korra becomes their best friend, learing about thier life in the Foggy Swamp. Later in the story, Korra's almost like a Johnny Silverhand to the new Avatars: manifested at will, a voice constantly in thier heads offering commentary, occasionally butting heads with them in a complex but close and encouraging relationship.

  • Asami Sato: Largely as she was in LoK, driven, collected, strong, smart, loyal. Now she's fifty, with a cybernetic leg from an accident. Asami still altruistic, and retained control of Future Industries through the years, but struggles with pushback from a corporate world driven by expansion and greed, and ultimately has to grapple with some of what her own company has done under her nose.

  • Mako: Largely as he was, brooding, cool, a noir-like detective. Recently retired as police chief, and has been secretly piecing together the conspiracy running through the plot.

  • Ren: The blacksmith Korra meets on Kyoshi Island. Softspoken, painfully shy, air headed and ADD, stocky and green-eyed, Ren nonetheless has a dry wit. He's self depreciating to a fault, but has a soft heart. To Korra's utter shock, Ren is a metalbender and a lavabender, using the combination to effortlessly sculpt armor and weapons, and tinker with delicate electronics. He's terrified of lightning, with a massive scar covering his back that flares up in storms or when anxious. Almost as broken as Korra is at the start, Ren reveals that his father's ancestors were lavabending miners and blacksmiths in the Hundred Years War. His past is initially shrouded in mystery, but its slowly revealed that his mother is the scientist who originally conceived of spirit vine technology, and that Varrick only replicated some of her work. Ren's mom has an 'Oppenheimer moment' and defects from Kuvira's proto Earth Empire. Ren ends up as the only survivor, deeply scarred from a spirit vine "detonation" similar to the LoK finale, that fused his soul to his body, and he's hiding from warlords hunting him for what he knows. Through the story, he grows particularly close to Asami and Korra, and grapples with some of the technology he pioneers.

  • Kaida: CTO of Future Industries, Kaida is the biological daughter of Korra and Ren, who both died when she was 11. Utterly tenacious, hot-blooded, fearless, and a fierce fighter like her mom, Kaida barges into the story literally melting the metal floor in front of reporters harassing her 'mom,' Asami. Fiercely intelligent, impulsive, but with some of her dad's air-headedness, introversion, and love of tinkering with technology, Kaida is almost constantly clad in meteor-metal alloy plate armor she wears as a second skin. She favors a jian, like Korra learned to use on Kyoshi Island. Kaida a talented engineer, but struggles with the tremendous legacy she's been thrust into.

  • Yuri Sei'naka: One of many vying for supremacy in the Sei'naka family, Yuri resembles Azula; A charismatic leader with a ruthless streak, an obsession with perfection, and a fantastically talented firebender, she has Azula's the same sharp yellow eyes and features. Like her twin brother, Yoru, Yuri chose the 'hard' path of bending over advanced cybernetics the wealthy have access too. Nevertheless, she has a good moral compass, and is unconditionally loyal to her brother.. The siblings have an intense rivalry with Kaida, just as thier company rivals Future Industries.

  • Yoru Sei'naka: A firebending and lightning bending prodigy and a cunning strategist, Yoru is mute, having lost his ability to speak in a sparring accident as a kid. Yoru and Yuri are practically inseperable, with Yuri serving as his voice. Tasked with tracking down the unkown Avatar by the matriarch of the clan, and always beholden to his intense sense of honor, Yoru suffers through a tragic 'Zuko' arc through the story.

spoiler

  • Father Glowworm: The ancient spirit survived the death of Yun, and is an ever-present invisible hand through the story, albeit with a newfound distate for humans. The swamp, taboo spirit vine technology, and just how he tunnels between worlds will all tie into crises Priya and Nikki must navigate.

  • I'm still working on other antagonists, but there will be a warlord who tries to capture Ren on Kyoshi Island, a ruthless corporate matriarch of the Sei'naka dynasty (Natsu?), a charismatic rebel like something between Amon and Zaheer, and more. I'm also thinking on a blind airbending thief who rejected his rich family, and a loud, warm Sun Warrior whos people have resettled in Republic City, and an introverted netrunner-like hacker as companions for the Avatar.


  • Other thoughts:

    • I don't like some 'leaked' aspects of the upcoming show, like the twin Avatars being nine and the White Lotus being so involved and 'problematic.' I'd much rather have the twin Avatars be lost, ignorant of thier own nature in the Foggy Swamp because they appear to be waterbenders with a proclivity for mud.

    • On that note, stealing the idea from here, maybe Priya can only bend air and water, while Nikki can only bend earth and fire, reflecting the split of their spirits and personalities.

    • Remnants of the Northern and Southern Water Tribes have drifted to political extremes.

    • The 'wasteland' is populated by spirits, and human opportunists looking to brave it.

    • The Avatars' monkey cat companion is a spirit they befriended in the forest.

    • Spirit Vine technology is taboo and effectively 'lost' after the calamity.

    • The Avatars' Tribe lives atop a Lion Turtle the swamp hid for millenia.

    • The 'nature' of the Foggy Swamp is expanded. For instance, in one chapter, Priya and Nikki manifest and talk to respresentations of their parents, built from the collectively memory of everyone who ever knew them, all connected though vines. It brings up existential questions in Korra's head, and parallels with some of the spirit-based technology the rest of the world has developed.


    ...So, those are my scattered thoughts so far.

    Does that sound like a sane, plausible base for a post-LoK story? Do you think any of it would fit into canon? I particularly like the idea of a 'metal lavabending' canon companion, and maybe some more futuristic elements in the havens that do exist.

    37
    submitted 1 month ago* (last edited 1 month ago) by brucethemoose to c/avatar
     

    Most details are in the article ^

    Reddit source of images: https://www.reddit.com/r/TheLastAirbender/comments/1hi2tte/more_confirmation_on_the_leaks_this_was_using_the/

    I find this interesting! Post apocalyptic is a good way to "reset" the world, and the idea of twin Avatars has been batted around the fandom for some time.

    56
    submitted 1 month ago* (last edited 1 month ago) by brucethemoose to c/technology
     

    Maybe even 32GB if they use newer ICs.

    More explanation (and my source of the tip): https://www.pcgamer.com/hardware/graphics-cards/shipping-document-suggests-that-a-24-gb-version-of-intels-arc-b580-graphics-card-could-be-heading-to-market-though-not-for-gaming/

    Would be awesome if true, and if it's affordable. Screw Nvidia (and, inexplicably, AMD) for their VRAM gouging.

    327
    submitted 3 months ago* (last edited 3 months ago) by brucethemoose to c/selfhosted
     

    I see a lot of talk of Ollama here, which I personally don't like because:

    • The quantizations they use tend to be suboptimal

    • It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

    • It abstracts away things that you should really know for hosting LLMs.

    • I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

    So, here's a quick guide to get away from Ollama.

    • First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.

    • Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.

    • Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

    Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

    As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

    • Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI

    • Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine

    • This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

    Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

    This can go wrong, if anyone gets stuck I can help with that.

    • Next, figure out how much VRAM you have.

    • Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.

    • Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER

    • Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.

    • With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2

    • There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:

    • 4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.

    • 6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2

    • 8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2

    • 16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.

    • 20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2

    • 32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.

    • 48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

    Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

    • Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader

    • Put it in your TabbyAPI models folder, and follow the documentation on the wiki.

    • There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)

    • Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui

    • And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.

    • Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

    Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

     

    cross-posted from: https://lemmy.world/post/19925986

    https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

    Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.

    All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).

    And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.

    I am running 34B locally, and it seems super smart!

    As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.

    Get usable quants here:

    GGUF: https://huggingface.co/bartowski?search_models=qwen2.5

    EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5

    view more: next ›