Technology

35718 readers

292 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago

MODERATORS

[email protected]

Petals - Run large language models at home, BitTorrent‑style (self.technology)

submitted 2 years ago* (last edited 2 years ago) by Blaed to c/[email protected]

9 comments fedilink hide all child comments

cross-posted from: https://lemmy.world/post/1535820

I'd like to share with you Petals: decentralized inference and finetuning of large language models

https://petals.ml/

https://research.yandex.com/blog/petals-decentralized-inference-and-finetuning-of-large-language-models

What is Petals?

Run large language models at home, BitTorrent‑style

Run large language models like LLaMA-65B, BLOOM-176B, or BLOOMZ-176B collaboratively — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning. Single-batch inference runs at 5-6 steps/sec for LLaMA-65B and ≈ 1 step/sec for BLOOM — up to 10x faster than offloading, enough for chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec. Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.

Colab Link

GitHub Docs

Overview of the Approach

On a surface level, Petals works as a decentralized pipeline designed for fast inference of neural networks. It splits any given model into several blocks (or layers) that are hosted on different servers. These servers can be spread out across continents, and anybody can connect their own GPU! In turn, users can connect to this network as a client and apply the model to their data. When a client sends a request to the network, it is routed through a chain of servers that is built to minimize the total forward pass time. Upon joining the system, each server selects the most optimal set of blocks based on the current bottlenecks within the pipeline. Below, you can see an illustration of Petals for several servers and clients running different inputs for the model.

Benchmarks

We compare the performance of Petals with offloading, as it is the most popular method for using 100B+ models on local hardware. We test both single-batch inference as an interactive setting and parallel forward pass throughput for a batch processing scenario. Our experiments are run on BLOOM-176B and cover various network conditions, from a few high-speed nodes to real-world Internet links. As you can see from the table below, Petals is predictably slower than offloading in terms of throughput but 3–25x faster in terms of latency when compared in a realistic setup. This means that inference (and sometimes even finetuning) is much faster with Petals, despite the fact that we are using a distributed model instead of a local one.

Conclusion

Our work on Petals continues the line of research towards making the latest advances in deep learning more accessible for everybody. With this work, we demonstrate that it is feasible not only to train large models with volunteer computing, but to run their inference in such a setup as well. The development of Petals is an ongoing effort: it is fully open-source (hosted at https://github.com/bigscience-workshop/petals), and we would be happy to receive any feedback or contributions regarding this project!

You can read the full article here

top 9 comments

sorted by: hot top controversial new old

[–] muzzle 15 points 2 years ago

I can't honestly tell if any of this really has a future, but it should super interesting.

[–] baascus 9 points 2 years ago

We will watch your career with great interest

[–] Zeth0s 6 points 2 years ago (1 children)

Have anyone tried it?

[–] [email protected] 5 points 2 years ago (1 children)

I just did via http://chat.petals.ml/ - was interesting enough to transcribe and post the results, although it did crash once I got deeper into the analysis due to rate limiting. It definitely has potential.

[–] Zeth0s 4 points 2 years ago

Thanks. Main problem I see with p2p is that it needs to gain a bit of traction, an active community behind it. Let's see if it gets the traction needed

[–] [email protected] 2 points 2 years ago (2 children)

Isn't this what the koboldai horde already does?

[–] [email protected] 2 points 2 years ago

Well new project for me to look into thanks. I have only seen petals as a distributed inference engine, so seeing more in the space would be promising.

[–] Blaed 1 points 2 years ago* (last edited 2 years ago)

The KoboldAI Horde was the first thing that came to my mind when I heard about this too. After some research, it appears Petals and the AI-Horde are similar in concept, but different in strategy and execution.

The Kobold AI-Horde utilizes a 'kudos-based economy' to prioritize render/processing queues.

Petals seems to utilize a different routing/queue mechanism that prioritizes optimization over participation.

So you're not wrong. The AI-Horde accomplishes crowd compute through a similar high level approach, however, the biggest difference (at a glance) seems to be how the I/O is handled and prioritized between the two platforms. That's a bit of an oversimplification, but it communicates the idea.

I really like the concept of crowd-compute, but I'm not sure it'll get as popular as it needs to rival emerging (corporate) exaflops of compute. I hope Petals & AI-Horde benefit from the mutual competition. It would be really cool to see a future where George Hotz & tinycorp actually commoditize the petaflop for consumers. Maybe then crowd compute can begin to rival some of these big tech entities that otherwise dwarf available silicon.

[–] [email protected] 0 points 2 years ago

Looks pretty neat!