335
submitted 10 months ago by ruud to c/lemmyworld

So I've been troubleshooting the federation issues with some other admins:

(Thanks for the help)

So what we see is that when there are many federation workers running at the same time, they get too slow, causing them to timeout and fail.

I had federation workers set to 200000. I've now lowered that to 8192, and set the activitypub logging to debugging to get queue stats. RUST_LOG="warn,lemmy_server=warn,lemmy_api=warn,lemmy_api_common=warn,lemmy_api_crud=warn,lemmy_apub=warn,lemmy_db_schema=warn,lemmy_db_views=warn,lemmy_db_views_actor=warn,lemmy_db_views_moderator=warn,lemmy_routes=warn,lemmy_utils=warn,lemmy_websocket=warn,activitypub_federation=debug"

Also, I saw that there were many workers retrying to servers that are unreachable. So, I've blocked some of these servers:

commallama.social,mayheminc.win,lemmy.name,lm.runnerd.net,frostbyrne.io,be-lemmy.org,lemmonade.marbledfennec.net,lemmy.sarcasticdeveloper.com,lemmy.kosapps.com,pawb.social,kbin.wageoffsite.com,lemmy.iswhereits.at,lemmy.easfrq.live,lemmy.friheter.com,lmy.rndmm.us,kbin.korgen.xyz

This gave good results, way less active workers, so less timeouts. (I see that above 3000 active workers, timeouts start).

(If you own one of these servers, let me know once it's back up, so I can un-block it)

Now it's after midnight so I'm going to bed. Surely more troubleshooting will follow tomorrow and in the weekend.

Please let me know if you see improvements, or have many issues still.

you are viewing a single comment's thread
view the rest of the comments
[-] phiresky 49 points 10 months ago* (last edited 10 months ago)

I want to say that with 0.18 the definition of federation_workers has changed massively due to the improved queue. As in, whatever is good in 0.17 is not necessarily good for 0.18.

On 0.18, it probably makes sense to have it around 100 to 10'000. Setting it to 0 is also be an option (unlimited, that's the default). Anything much higher is probably a bad idea.

On 0.18, retry tasks are also split into a separate queue which should improve things in general.

0 might have perf issues since every federation task is one task with the same scheduling priority as any other async task (like ui / user api requests). So if 10k federation tasks are running and 100 api requests are running then tokio will schedule the api requests with probability 100 / (10k+100) (if everything is cpu-limited). (I think, not 100% sure how tokio scheduling works)

this post was submitted on 29 Jun 2023
335 points (98.8% liked)

Lemmy.World Announcements

28532 readers
8 users here now

This Community is intended for posts about the Lemmy.world server by the admins.

For support with issues at Lemmy.world, go to the Lemmy.world Support community.

Support e-mail

Any support requests are best sent to [email protected] e-mail.

Donations ๐Ÿ’—

If you would like to make a donation to support the cost of running this platform, please do so at the following donation URLs.

If you can, please use / switch to Ko-Fi, it has the lowest fees for us

Ko-Fi (Donate)

Bunq (Donate)

Open Collective backers and sponsors

Patreon

founded 11 months ago
MODERATORS