It is the data that the server holds that has been difficult to reproduce. There may even be some data in these long-running servers from older versions that testing on new code doesn't generate in the same patterns.
Lemmy Project Priorities Observations
I've raised my voice loudly on meta communities, github, and created new [email protected] and [email protected] communities.
I feel like the performance problems are being ignored for over 30 days when there are a half-dozen solutions that could be coded in 5 to 10 hours of labor by one person.
I've been developing client/server messaging apps professionally since 1984, and I firmly believe that Lemmy is currently suffering from a lack of testing by the developers and lack of concern for data loss. A basic e-mail MTA in 1993 would send a "did not deliver" message back to message sender, but Lemmy just drops delivery and there is no mention of this in the release notes//introduction on GitHub. I also find that the Lemmy developers do not like to "eat their own dog food" and actually use Lemmy's communities to discuss the ongoing development and priorities of Lemmy coding. They are not testing the code and sampling the data very much, and I am posting here, using Lemmy code, as part of my personal testing! I spent over 100 hours in June 2023 testing Lemmy technical problems, especially with performance and lost data delivery.
I'll toss it into this echo chamber.
The bot cleanup work has some interesting numbers regarding data n the database: https://sh.itjust.works/post/1823812
lemmy.ml has a much wider range of dates on communities, post, comments, user accounts than what new testing would generate. Even if you install a test server with the same quantity of data, the date patterns would come out a lot different from the organically grown lemmy.ml
All I know is lemmy.ml errors out every single day I do routine browsing, and I haven't seen any website throw this many errors in many many years. Delete of Accounts could also possibly be causing these 2 to 4 minute periods of overload, even with the 0.18.3 fixes.
The live-server data not being reproducible on testing systems....
I have avoided running performance tests on live servers with the API because I didn't want to be contributing to server overloads that were already happening when I would just visit the normal interface reading content.
Maybe it's time to go back to that approach. And see if perhaps schedule jobs can be detected. Can I detect when a server has updated rank based on output of a post listing? Can I detect lag time when PostgreSQL is writing that batch data?
Are PostgreSQL overloads happening at similar times on multiple severs when a ban-remove data or account delete replicates?
Unrelated, but I can never find the name of that community because it is "Lemmy App Dev" but the database key is "lemmydev", which is what I remember.