Media Application Scaling

1 readers
1 users here now

founded 1 year ago
MODERATORS
1
 
 

I think a client-API self-awareness of performance problems and cost of running a server.... could be built into the design as an owner/operator choice.

A search engine should see generic content, posts and filters, without changes.

But the lemmy SQL logic for PostgreSQL is to burden fetching posts and comments with all kinds of user-specific customization. This kills caching potential if it's done way at the backend.

Page 3 of posts in [email protected] will be different for a user who has blocked a person in that list. Right now, that burden is placed upon PostgreSQL and having to rewrite indexes on every INSERT and do steps in every SELECT.

For massive scale on lower-cost hardware, I suggest that the idea be placed where a smarter-client API is self-aware of this problem and page 3 of a community or All hot/active/top hour - be the same - and the client is given the burden of fetching the person block list and filtering. ---OR--- an intermediate front-end ,component of Lemmy that could run on multiple servers / scale out / do the filtering for that specific user.

Even paging itself, the page length - is already variable - another cache issue. Eliminate that and just encourage over-fetching of post and comments and filtering out duplicates. ---OR---- even just fetching ID numbers of new/old and a very-smart client having an ID listing of the entire page and filling in content.

But certainly during heavy server load, when servers are on the verge of crashing from too much data - eliminating personal exclusions of communities and persons on fetching posts/comments can have some PostgreSQL offloading. Even NSFW might fit into that.

Sorry about my language this morning, sloppy English.

2
 
 

Even if starting out doing http connections one to one like vote rows are replicated today. This would at least cut down the PostgreSQL storage of non-local instance votes in favor of having just one single row per comment and post (which is already existing overhead, even for non-local post and comment).

The home instance of a community (which owns the post, which owns the comment) is the only place to have the full history of activity given how Lemmy 0.18.2 has no backfill procedure for activities before first instance subscriber to a community.

3
 
 

There are also issues lurking with accumulation of data. Moving to a batch processing system might want to consider that some instance operators may only wish to retain 60 days of fresh content vs. having every single history of content for a community for search engines and local-searching. The difference in performance is huge, which is why popular Lemmy servers have been crashing constantly - the amount of data in the tables entirely changes the performance characteristics.

Right now, Lemmy has no concept of tiered storage or absent content from replication or purge choices. Looking from the bottom-up, API client before touching PostgreSQL - a smart caching layer could even proxy to the API of a peer instance and offer a virtual copy (cached) of the data for a listing or post. Such a design could intelligently choose to do this for a small number of requests and avoid burdening PostgreSQL with the storage of a post from months or years ago that a few people take a recent interest in (or a search engine wants to pull a copy of old posts).

4
 
 

When I see SQL select statements like this: https://lemmy.ml/comment/793612

I start to think about caching.

  1. Postings with 200 comments or less I would maybe treat entirely different code path.

  2. Postings that are lively and being updated, different code path. If its a fresh-meat topic, with comments going crazy every minute, code may want to consider that.

  3. Over 200 comments, maybe just dump the whole damn comments for that posting outside of PostgreSQL into nosql or some cache system, then start processing them independently.

  4. Some kind of site self-awareness that the serverr wants to run on a resource budget or is under some major social event load (such as an airplane crash getting huge influx of users for discussion)... and you are willing to shed timely comment publishing to keep your database activity sane. You might be willing to not read new (and edited/deleted) comments from that posting for 2 minutes, just to offload your database.

  5. That there may be several front-end "rust code" servers of lemmy_server talking to a single PostgreSQL backend here. And that the caching in #3 above on this list would be duplicated on those mid-tier servers.

Kind of rambling, but it's on my mind.

5
 
 

Based on a lot of personal messaging app development and SQL database experience back in the low-performance hardware days....

I speculate that the developers of Lemmy have not populated the database with large amounts of test data. It shows a lot of symptoms of a codebase that never had a test-data-generation app that would create say 1000 test user accounts, 50 test communities, stuff 1000 random postings in each test community - and then put 10 to 150 random comments of various lengths on each of those postings.

It looks to me more like it started with an empty SQL database and has incrementally grown to a few thousand end-users creating a few posts a day with a relatively small number of comments on those postings.

Growing pains are normal, but it may help to have an instance out there for performance testing of the database queries and insert performance... and even invite end-users to hammer on the test instance all at a scheduled time (testing party) to find where the performance bottlenecks lie.

6
7
 
 

A major problem with a data-centered application like Lemmy is that you can run into cache misses pretty easily.

Denial-of-Service by Requesting Old Postings

It's a denial-of-service possibility that someone can do a web crawler/API crawler on your postings... the server fetching posts and comments from 3 months ago. Intentional or otherwise, busying up your server and putting old content into the database cache.

On a Lemmy/Reddit style media forum, the changes to the data tend to slow way down after the first 48 hours of a posting when there is a stream of new postings by end-users. This favors having some concept of inactive vs. active posts and how you choose to cache pre-generated output vs. live database queries of real-time data.

For the inactive postings, you may want to have a caching that's entirely outside the database and favors avoiding the database. Such as limiting the sort options to end-users on inactive posts and storing pre-fetched content directly on server disk files that can be loaded instead of assembling from the live database each API/app query.

Major Social Events

As a server operator, you might desire fallback behavior during surges. September 11, 2001 was an example where major social media websites and even TV news websites all fell over due to to everyone flocking to get news and make comments about the event.

Do you want your social media system to show 'internal server error', or more gracefully go into a read-only mode. In a read-only mode you could probably still process logins and rendering the existing postings and comments, but not allow voting and comments to flood into the database.

Partly the assumption is that a community like Lemmy doesn't have a huge budget and the ability to spin up dozens of new servers like Twitter Reddit, or Facebook does... so building in 'fallback' behavior switches to the code for the owner/operators might be more important than the big well-funded players.

NOTE: I have autism and often struggle with my language, this needs some editing, please excuse if I'm being confusing or not providing clear enough examples. This is kind of a starting point on expressing an idea vs. a well-written posting.