this post was submitted on 12 Jun 2023
140 points (91.2% liked)

Selfhosted

40943 readers
466 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
140
submitted 2 years ago* (last edited 2 years ago) by Averrin to c/selfhosted
 

Correct me if I'm wrong. I read ActivityPub standards and dug a little into lemmy sources to understand how federation works. And I'm a bit disappointed. Every server just has a cache and the ability to fetch something from another known server. So if you start your own instance, there is no profit for the whole network until you have a significant piece of auditory (e.g. private instances or servers with no users). Are there any "balancers" to utilize these empty instances? Should we promote (or create in the first place) a way how to passively help lemmy with such fast growth?

you are viewing a single comment's thread
view the rest of the comments
[–] TinfoilBeanieTech 83 points 2 years ago (4 children)

You are right. On the one hand, it's kind of bad, naive distributed architecture (my day job), it could have been done much better. On the other hand, the more important point is that it demonstrates an alternative to centralized. We'll learn a lot about usage patterns here, get new ideas, and either improve Lemmy or build something better from the ground up. Big thanks to Reddit for driving users this way to test scalability and get much better knowledge of usage.

[–] [email protected] 34 points 2 years ago (2 children)

It's not distributed architecture as you normally think it - it's a decentralised federation. It's an important distinction from your typical distributed architecture app.

[–] [email protected] 5 points 2 years ago (2 children)

Can you explain what's the difference?

[–] [email protected] 8 points 2 years ago* (last edited 2 years ago)

A distributed architecture generally refers to a single application or service designed to be resilient to individual data center failures. For example, Reddit, a centralized application controlled by Reddit itself, operates data centers around the world to process user transactions. In the event of an outage in a specific location, such as California, Reddit would still be able to function because its infrastructure for handling user requests and serving data would automatically switch to other functioning data centers elsewhere, like Nevada, Arizona, or Washington. This is an example of a distributed architecture.

On the other hand, a decentralized federation does not consist of a single application. Instead, it involves a software platform like Lemmy, which is hosted on multiple individual hosts. When a user signs up with one host, they can interact with users from other hosts, but each host manages its own infrastructure. For instance, someone could host a Lemmy instance on an old laptop they found in their closet and name it ballsuckers.com, while another person could host a Lemmy instance in the cloud with a properly designed distributed architecture and name it bingbong.com. Each host is responsible for managing its own instance. Users from both instances can interact with each other, but if, for example, the hard drive of ballsuckers.com were to fail, the entire ballsuckers.com instance would go down. However, this would not affect bingbong.com because its infrastructure is separate and managed independently.

I hope this helps!

[–] TinfoilBeanieTech 1 points 2 years ago

I understand the difference, and that it makes it harder to scale.

[–] [email protected] 20 points 2 years ago (2 children)

it could have been done much better.

Care to expand on this point?

[–] [email protected] 5 points 2 years ago (2 children)

Disclaimer: I've only looked a bit at the protocols and high levels descriptions of how it works, and this is just my understanding of it. But it seems to track.

let's take .. [email protected] for example. Right now lemmy.world is the Source of Truth on this, which means if you sign up for it on a different host, let's say myawersomeinstance.com, that first contacts lemmy.world, copies over posts, and then subscribes on new posts for that. Actually not 100% sure if lemmy.world contacts myawersomeinstance.com when there's a new post, or myawersomeinstance.com polls lemmy.world.. But anyway, point is, lemmy.world is authority on it. myawersomeinstance.com also have [email protected] data, but it's a copy of it. And lemmy.world is only authority. So if you post something, your server then sends it to lemmy.world and waits a reply. Then lemmy.world contacts all instances that has at least one user following this to tell about the new post. And that new post now exists on a few hundred databases.

The problem is the scaling is whack. Okay, you can have 5000 federated servers with users subscribing to [email protected], but that means lemmy.world needs to update 5000 servers per post, and there'll be 5000x storage used for that post, and ALL 5000 servers contacts lemmy.world to get the new good stuff.

Frankly, it's a scaling nightmare. As for a different approach, you could have private / public keys and sign updates from lemmy.world and allow the other instances to fetch the new data from each other. That would also allow more relaxed caching, since it would be generally lower cost to re-fetch the data. Now you need aggressive caching because you don't want lemmy.world to keel over and die form every server on the planet wanting to hear the latest and greatest posts all the time.

[–] [email protected] 3 points 2 years ago* (last edited 2 years ago)

Thanks for the in depth write up! I haven't looked too far into the docs or the subscription model, but is this a fault on Lemmy's end, or is this a function of how activity pub handles federated communication? (I'm very new to activity pub/federation, just now reading through the activity pub docs)

I do like your idea of distributed replication via keys,much better than what I had brainstormed

Edit: yeah it does look like it's a function of activity pub, wonder if theres a more scalable federation protocol out there

[–] [email protected] 2 points 2 years ago (1 children)

Could lemmy.world put a load balancer in front and use that to direct requests to different instances of lemmy.world? Not sure if that question is dumb I'm not a technical guy.

[–] [email protected] 3 points 2 years ago

It's not dumb at all, and it's a common scaling technique. But the software needs to support it, and I have no idea if lemmy has support for running multiple instances for one server.

[–] TinfoilBeanieTech 1 points 2 years ago

Seeing Lemmy groan under influx of new users, but still a much smaller number than established centralized apps made me start wondering how it would scale to a couple of orders magnitude larger numbers. I’ve only started diving into code and architecture, but I’m worried that as the number of instances grow they’ve got an N! connection problem going. This is not a simple problem to fix for a federated system, but it’s got to be addressed eventually.

[–] [email protected] 17 points 2 years ago (1 children)

What makes a distributed system good that Lemmy hasn't done? Seems like a pretty robust system to me, seems like scaling issues are on the instance host themself. With Reddit's experience, I don't see how there are issues

[–] [email protected] -2 points 2 years ago

If there was an easy solution that balanced decent UX and performance, we'd have it by now!

[–] GhostCowboy76 2 points 2 years ago (1 children)
[–] TinfoilBeanieTech 1 points 2 years ago

I’m a Software Architect who has worked on some very large data sets and distributed systems. I’ve never used the title “Data Architect” but I meet the definition. So, yes.