this post was submitted on 29 Jun 2023

335 points (98.8% liked)

Lemmy.World Announcements

29320 readers

5 users here now

This Community is intended for posts about the Lemmy.world server by the admins.

Follow us for server news 🐘

Outages 🔥

https://status.lemmy.world

For support with issues at Lemmy.world, go to the Lemmy.world Support community.

Support e-mail

Any support requests are best sent to [email protected] e-mail.

Report contact

DM https://lemmy.world/u/lwreport
Email [email protected] (PGP Supported)

Donations 💗

If you would like to make a donation to support the cost of running this platform, please do so at the following donation URLs.

If you can, please use / switch to Ko-Fi, it has the lowest fees for us

Join the team

founded 2 years ago

MODERATORS

335

Federation troubleshooting (self.lemmyworld)

submitted 2 years ago by ruud to c/lemmyworld

39 comments fedilink hide all child comments

So I've been troubleshooting the federation issues with some other admins:

(Thanks for the help)

So what we see is that when there are many federation workers running at the same time, they get too slow, causing them to timeout and fail.

I had federation workers set to 200000. I've now lowered that to 8192, and set the activitypub logging to debugging to get queue stats. RUST_LOG="warn,lemmy_server=warn,lemmy_api=warn,lemmy_api_common=warn,lemmy_api_crud=warn,lemmy_apub=warn,lemmy_db_schema=warn,lemmy_db_views=warn,lemmy_db_views_actor=warn,lemmy_db_views_moderator=warn,lemmy_routes=warn,lemmy_utils=warn,lemmy_websocket=warn,activitypub_federation=debug"

Also, I saw that there were many workers retrying to servers that are unreachable. So, I've blocked some of these servers:

commallama.social,mayheminc.win,lemmy.name,lm.runnerd.net,frostbyrne.io,be-lemmy.org,lemmonade.marbledfennec.net,lemmy.sarcasticdeveloper.com,lemmy.kosapps.com,pawb.social,kbin.wageoffsite.com,lemmy.iswhereits.at,lemmy.easfrq.live,lemmy.friheter.com,lmy.rndmm.us,kbin.korgen.xyz

This gave good results, way less active workers, so less timeouts. (I see that above 3000 active workers, timeouts start).

(If you own one of these servers, let me know once it's back up, so I can un-block it)

Now it's after midnight so I'm going to bed. Surely more troubleshooting will follow tomorrow and in the weekend.

Please let me know if you see improvements, or have many issues still.

top 28 comments

sorted by: hot top controversial new old

[–] phiresky 49 points 2 years ago* (last edited 2 years ago)

I want to say that with 0.18 the definition of federation_workers has changed massively due to the improved queue. As in, whatever is good in 0.17 is not necessarily good for 0.18.

On 0.18, it probably makes sense to have it around 100 to 10'000. Setting it to 0 is also be an option (unlimited, that's the default). Anything much higher is probably a bad idea.

On 0.18, retry tasks are also split into a separate queue which should improve things in general.

0 might have perf issues since every federation task is one task with the same scheduling priority as any other async task (like ui / user api requests). So if 10k federation tasks are running and 100 api requests are running then tokio will schedule the api requests with probability 100 / (10k+100) (if everything is cpu-limited). (I think, not 100% sure how tokio scheduling works)

[–] chaos 42 points 2 years ago (2 children)

Holy fuck 200k workers!? I'm not familiar with lemmy internals but I've literally never seen any program run anything close to well at levels that high. Want some help from someone who is a DevOps engineer by day? I think I remember you said you were a psql dba professionally so maybe my experience could help out?

[–] ruud 19 points 2 years ago

Thanks, made a note of that

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago)

they should make a union

[–] [email protected] 39 points 2 years ago (1 children)

Ideally, we can fix this in the software eventually (most likely it has already been improved a lot in 0.18.1 - we'll find out for sure when lemmy.world upgrades), but for now it really does seem that defederating offline servers will massively improve the success rate of federated posts and comments reaching other instances.

[–] [email protected] 11 points 2 years ago (1 children)

Would un-linking them be sufficient?

[–] sunaurus 12 points 2 years ago

Yep, anything that will get your instance to stop sending activities to an unresponsive instance will help (at least for sure on 0.17.4)

[–] [email protected] 22 points 2 years ago (2 children)

Thanks for your work and sharing results!

I think that kbin and lemmy are going to ultimately have to record per-instance response time and back off on a given instance. Like, if another instance is failing or overloaded, it's going to have to reduce the frequency with which it attempts to communicate with that instance, to avoid having a ton of workers tied up trying to communicate with that instance.

[–] [email protected] 8 points 2 years ago

I'd probably recommend exponential backoff with a low max retries

[–] NuclearArmWrestling 5 points 2 years ago

Ideally, multiple instances could band together and create something like a hub that they all push and pull from. It's a little more centralization, but would likely significantly reduce overall network and CPU consumption.

[–] [email protected] 16 points 2 years ago (1 children)

Admins: FYI on lemmy logs:

Example of a federation message success (HTTP Response 200):

INFO actix_web::middleware::logger: 12.34.56.78 'POST /inbox HTTP/1.1' 200 0 '-' 'Lemmy/0.17.4; +https://remote.lemmy' 0.145673

and failure (HTTP Response 400):

INFO actix_web::middleware::logger: 12.34.56.78 'POST /inbox HTTP/1.1' 400 65 '-' 'Lemmy/0.17.4; +https://remote.lemmy' 0.145673

These are usually accompanied soon with a more verbose reason like Http Signature is expired.

Lemmy is far better at logging INCOMING stuff than it is OUTGOING. You can grep for activity_queue to get a sense if there are issues. This is not good:

Target server https://lemmy.lemmy/inbox rejected https://lemmy.local/activities/announce/9852ff01-c768-484b-a38da-da021cd1333, aborting

Also there are stats indicating "pile-ups":

Activity queue stats: pending: 1, running: 1, retries: 706, dead: 0, complete: 12

[–] [email protected] 6 points 2 years ago

This should hopefully be fixed with 0.18.1-rc.4 which has this PR merged: https://github.com/LemmyNet/lemmy/pull/3379

[–] dragontamer 12 points 2 years ago* (last edited 2 years ago) (3 children)

So as of right now, https://lemmy.ca still seems bugged.

These two are 0 comments here on lemmy.world, while comments clearly exist over at lemmy.ca.

https://lemmy.world/c/[email protected]

The opposite here: I made a test post at [email protected], so lemmy.world thinks there is +1 comment. But the true instance at http://lemmy.ca/c/microcontroller sees 0 comments, so my comment fails to traverse the federation to lemmy.ca.

So both imports and exports to these communities on lemmy.ca seem bugged.

EDIT: I should note that I was under "Subscription Pending" for days. I decided to unsubscribe (erm... stop pending?) and then re-subscribe. I'm now stuck on "Subscription Pending" again.

[–] mintiefresh 1 points 2 years ago

I'm having very similar issues on my lemmy.ca account as well.

[–] mintiefresh 1 points 2 years ago

The opposite here: I made a test post at [email protected], so lemmy.world thinks there is +1 comment. But the true instance at http://lemmy.ca/c/microcontroller sees 0 comments, so my comment fails to traverse the federation to lemmy.ca.

I am having similar issues with lemmy.ca as well. I have accounts on both and will just bounce back and forth until it's fixed before finding a landing spot.

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago)

I'm having very similar issues between my Lemmy.ca and Lemmy.world accounts as well.

I was actually trying to post this comment using .world, but it was lagging. So I switched over to my .ca account lol.

[–] Magiwarriorx 12 points 2 years ago* (last edited 2 years ago) (1 children)

Posted this last night, but reposting for visibility:

To those experiencing federation issues with communities that aren't local, make sure to properly set your language in our profile! I thought my off-instance communities were having extremely slow federation, but the issue was I didn't have English as one of my profile languages.

[–] ptrknvk 9 points 2 years ago (2 children)

I've read about it and I don't understand. I have a list of the languages in the settings, but I have all of them and I cannot remove or add any languages.

[–] [email protected] 15 points 2 years ago* (last edited 2 years ago)

The dialogue is very bad.

To select what languages you want, hold ctrl and click each language you want, to highlight it.

With the languages you want highlighted, save the settings.

[–] Magiwarriorx 12 points 2 years ago

The list is to select from, not what you have selected. If you click one it should highlight it blue. Then hit save at the bottom of the profile.

[–] [email protected] 11 points 2 years ago* (last edited 2 years ago) (1 children)

It's only been a few minutes but I'm seeing non timing out federation in my nginx access log. Hopefully it keeps working.
Also at least on my instance, lemmy.ml has completely broken, I'm not getting anything from it at all anymore. it dropped out at 13:52:22 and besides a couple few messages it's been silence since then. It seems to be working on lemmy.world so I'm not sure what's causing that.

[–] [email protected] 2 points 2 years ago* (last edited 2 years ago) (1 children)

How are you monitoring this? Was it just a good peek at nginx log or something else, greylog?

[–] [email protected] 2 points 2 years ago* (last edited 2 years ago)

I just grep'd the nginx access log for the lemmy.world IP address and looked at the access times. you can see if it's timing out is the response code is 400. sadly ~57% of the requests are currently timing out today. it seems to work for a bit then time out for about 10 minutes, at least there is some coming through now, before the requests had stopped completely.

lemmy.ml is also back in my logs. yay
lemmy.ml is almost perfect on the timeouts, so they must have managed to fix it.

[–] BitOneZero 6 points 2 years ago (1 children)

From what I've seen, there is a 10-second hard-coded timeout for HTTP, seems too low for the kind of load going on. Especially if the server is opening tons of connects to the same peer server.

[–] [email protected] 3 points 2 years ago (1 children)

I think that may be addressed soon as mentioned here: https://github.com/LemmyNet/activitypub-federation-rust/pull/52

[–] BitOneZero 3 points 2 years ago

That's for encryption signatures.

I'm talking about HTTP timeouts for connection:

https://github.com/LemmyNet/lemmy/blob/0f91759e4d1f7092ae23302ccb6426250a07dab2/src/lib.rs#L39

/// Max timeout for http requests
pub(crate) const REQWEST_TIMEOUT: Duration = Duration::from_secs(10);

[–] Denuath 5 points 2 years ago

My feeling is that the update has improved the situation. For example, when I filter by Hot, significantly more recent posts appear than before.

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago)

EDIT: nevermind, I figured out my issue https://github.com/LemmyNet/lemmy-ui/issues/1862

load more comments