this post was submitted on 18 Aug 2023
1 points (100.0% liked)

Lemmy Project Priorities Observations

5 readers
1 users here now

I've raised my voice loudly on meta communities, github, and created new [email protected] and [email protected] communities.

I feel like the performance problems are being ignored for over 30 days when there are a half-dozen solutions that could be coded in 5 to 10 hours of labor by one person.

I've been developing client/server messaging apps professionally since 1984, and I firmly believe that Lemmy is currently suffering from a lack of testing by the developers and lack of concern for data loss. A basic e-mail MTA in 1993 would send a "did not deliver" message back to message sender, but Lemmy just drops delivery and there is no mention of this in the release notes//introduction on GitHub. I also find that the Lemmy developers do not like to "eat their own dog food" and actually use Lemmy's communities to discuss the ongoing development and priorities of Lemmy coding. They are not testing the code and sampling the data very much, and I am posting here, using Lemmy code, as part of my personal testing! I spent over 100 hours in June 2023 testing Lemmy technical problems, especially with performance and lost data delivery.

I'll toss it into this echo chamber.

founded 1 year ago
MODERATORS
top 13 comments
sorted by: hot top controversial new old
[–] [email protected] 1 points 1 year ago (2 children)

If you read enough EXPLAIN ANALYZE output from PostgreSQL you start to get some ideas...

PostgreSQL makes estimates of queries before it dives into it. The same concept could be applied to Lemmy sorting orders and page numbers.

The whole project seem stuck on fringe cases. The reality is that the servers are running slow and overloading with routine fetches of Hot and Active post listings on page number 1, 2, or 3. Sure, some of the apps have speed scrolling to get to page 5 quickly... but there is too much concern out there for how page 970 behaves with a 5 month old posting... and the servers are crashing. Did I mention that the servers have been nonstop crashing since I came around in May?

Today I'm gong to experiment with doing what PostgreSQL does internally before picking an INDEX or a TABLE scan... some hints to the query where to go. For every sort order, there is a natural way to approach routine conventional paging... the trick is how to apply that to both single community and a blend of many communities.

[–] [email protected] 1 points 1 year ago (1 children)

Brainstorming....

We know the sort order and the page number we are loading before we start building any SQL statement at all. Regardless of any customized user preferences, these are foundational in terms of filtering out data.

For /c/All - page number is useful. It is when you get into /c/Subscribed with a blend of communities that the trouble begins.

There is also the behavior people might expect with "Top 1 hour" and running out of posts to read. On that specific choice, they are likely to accept it... but with Hot and Active....

So what exactly are Hot and Active... and how far back in time do they go?

[–] [email protected] 1 points 1 year ago (1 children)

For post listing, these are the three that aren't simple to grasp:

      SortType::Active => query
        .then_order_by(post_aggregates::hot_rank_active.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Hot => query
        .then_order_by(post_aggregates::hot_rank.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (2 children)
    hot_rank: 1728,
    hot_rank_active: 1728,

  -- Note: 1728 is the result of the hot_rank function, with a score of 1, posted now
  -- hot_rank = 10000*log10(1 + 3)/Power(2, 1.8)
[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)
lemmy@lemmy_alpha LOG:  duration: 50.220 ms  execute : WITH batch AS (SELECT a.id
	               FROM post_aggregates a
	               WHERE a.published > $1 AND (a.hot_rank != 0 OR a.hot_rank_active != 0)
	               ORDER BY a.published
	               LIMIT $2
	               FOR UPDATE SKIP LOCKED)
	         UPDATE post_aggregates a SET hot_rank = hot_rank(a.score, a.published),
	         hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)
	             FROM batch WHERE a.id = batch.id RETURNING a.published;
	    
2023-08-18 09:00:34.578 MST [1877420] lemmy@lemmy_alpha DETAIL: 
 parameters: $1 = '2023-08-16 23:40:31.149267', $2 = '1000'

[–] [email protected] 1 points 1 year ago
CREATE FUNCTION public.hot_rank(score numeric, published timestamp without time zone) RETURNS integer
    LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE
    AS $$
DECLARE
    hours_diff numeric := EXTRACT(EPOCH FROM (timezone('utc', now()) - published)) / 3600;
BEGIN
    IF (hours_diff > 0) THEN
        RETURN floor(10000 * log(greatest (1, score + 3)) / power((hours_diff + 2), 1.8))::integer;
    ELSE
        RETURN 0;
    END IF;
END;
$$;

[–] [email protected] 1 points 1 year ago

Difference between hot_rank and hot_rank_active

SET hot_rank = hot_rank(a.score, a.published),
        hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)"

[–] [email protected] 1 points 1 year ago

Brainstorming...

Every combination of community block and subscribe is too much....

But can we leverage community_aggregates. With all the overhead of updating it on every ++ of post and ++ of comment... can we just write a hint of where page 20 in sorts may fall?

each community could set a dirty flag on community_aggregates update. count alone can't be relied upon because it also decrements.

The hierarchy is important to consider. Every comment is owned by a post, every post is owned by a community, every community is owned by an instance. For Reddit, the user is secondary and their comment and post traditionally can be linked to and read even if the user deletes their account. For Lemmy, removal of content was an early design priority, but think that harms search engine use users and goes against the spirit of what a public forum is about (yes, everyone wants to turn it into private playland, and ignore that hackers break into major websites with full time security teams - and Lemmy being operated by unknown strangers isn't exactly something they should count on).

Ok, back on topic. With all the overhead currently in place to update community)_aggregates on each new comment and post, it makes sense to try and filter out the cruft of 5000 communities that may be stale yet bog down every PostgreSQL for posts for the most recent hour. I don't think I've seen one single mention by anyone analyzing the situation that lemmy.world having over 9,000 local communities is a huge factor - as Lemmy was sort of build with "20 communities" in 2020.

[–] [email protected] 1 points 1 year ago
CREATE OR REPLACE FUNCTION post_aggregates_score ()
    RETURNS TRIGGER
    LANGUAGE plpgsql
    AS $$
BEGIN
    IF (TG_OP = 'INSERT') THEN
        UPDATE
            post_aggregates pa
        SET
            score = score + NEW.score,
            upvotes = CASE WHEN NEW.score = 1 THEN
                upvotes + 1
            ELSE
                upvotes
            END,
            downvotes = CASE WHEN NEW.score = - 1 THEN
                downvotes + 1
            ELSE
                downvotes
            END,
            controversy_rank = controversy_rank (pa.upvotes + CASE WHEN NEW.score = 1 THEN
                    1
                ELSE
                    0
                END::numeric, pa.downvotes + CASE WHEN NEW.score = - 1 THEN
                    1
                ELSE
                    0
                END::numeric)
        WHERE
            pa.post_id = NEW.post_id;
    ELSIF (TG_OP = 'DELETE') THEN
        -- Join to post because that post may not exist anymore
        UPDATE
            post_aggregates pa
        SET
            score = score - OLD.score,
            upvotes = CASE WHEN OLD.score = 1 THEN
                upvotes - 1
            ELSE
                upvotes
            END,
            downvotes = CASE WHEN OLD.score = - 1 THEN
                downvotes - 1
            ELSE
                downvotes
            END,
            controversy_rank = controversy_rank (pa.upvotes + CASE WHEN NEW.score = 1 THEN
                    1
                ELSE
                    0
                END::numeric, pa.downvotes + CASE WHEN NEW.score = - 1 THEN
                    1
                ELSE
                    0
                END::numeric)
        FROM
            post p
        WHERE
            pa.post_id = p.id
            AND pa.post_id = OLD.post_id;
    END IF;
    RETURN NULL;
END
$$;
[–] [email protected] 1 points 1 year ago

It is interesting to ponder why Reddit had a multi-reddit 100 limit and similar for /r/All blocking limit of 100. As subscribed home page didn't have a 100 limit. What performance reason would be solved by 100 limit if you had to render home pages with 2500 subreddits anyway?

[–] [email protected] 1 points 1 year ago
 hot_rank_now_minus1second_score_250 |        current_now         
-------------------------------------+----------------------------
                                6899 | 2023-08-18 19:02:54.807922
(1 row)

 hot_rank_now_minus1day_score_250 |        current_now         
----------------------------------+----------------------------
                               68 | 2023-08-17 19:02:55.808032
(1 row)

 hot_rank_now_minus2day_score_250 |        current_now         
----------------------------------+----------------------------
                               21 | 2023-08-16 19:02:55.808132
(1 row)

 hot_rank_now_minus3day_score_250 |        current_now        
----------------------------------+---------------------------
                               10 | 2023-08-15 19:02:55.80823
(1 row)

 hot_rank_now_minus6day_score_250 |        current_now         
----------------------------------+----------------------------
                                3 | 2023-08-12 19:02:55.808326
(1 row)

 hot_rank_now_minus9day_score_250 |        current_now         
----------------------------------+----------------------------
                                1 | 2023-08-09 19:02:55.808419
(1 row)

 hot_rank_now_minus12day_score_250 |        current_now         
-----------------------------------+----------------------------
                                 0 | 2023-08-06 19:02:55.808512
(1 row)

 hot_rank_now_minus22day_score_250 |        current_now         
-----------------------------------+----------------------------
                                 0 | 2023-07-27 19:02:55.808606
(1 row)

from

executing this:

-- when does hot_rank go flat like old soda?

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '1 second')) AS hot_rank_now_minus1second_score_250, timezone('utc', now() - interval '1 second') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '1 day')) AS hot_rank_now_minus1day_score_250, timezone('utc', now() - interval '1 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '2 day')) AS hot_rank_now_minus2day_score_250, timezone('utc', now() - interval '2 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '3 day')) AS hot_rank_now_minus3day_score_250, timezone('utc', now() - interval '3 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '6 day')) AS hot_rank_now_minus6day_score_250, timezone('utc', now() - interval '6 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '9 day')) AS hot_rank_now_minus9day_score_250, timezone('utc', now() - interval '9 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '12 day')) AS hot_rank_now_minus12day_score_250, timezone('utc', now() - interval '12 day') AS current_now;

SELECT hot_rank(250::numeric, timezone('utc', now() - interval '22 day')) AS hot_rank_now_minus22day_score_250, timezone('utc', now() - interval '22 day') AS current_now;
[–] [email protected] 1 points 1 year ago (1 children)
    query = match options.sort.unwrap_or(SortType::Hot) {
      SortType::Active => query
        .then_order_by(post_aggregates::hot_rank_active.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Hot => query
        .then_order_by(post_aggregates::hot_rank.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
      SortType::New => query.then_order_by(post_aggregates::published.desc()),
      SortType::Old => query.then_order_by(post_aggregates::published.asc()),
      SortType::NewComments => query.then_order_by(post_aggregates::newest_comment_time.desc()),
      SortType::MostComments => query
        .then_order_by(post_aggregates::comments.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopAll => query
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopYear => query
        .filter(post_aggregates::published.gt(now - 1.years()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopMonth => query
        .filter(post_aggregates::published.gt(now - 1.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopWeek => query
        .filter(post_aggregates::published.gt(now - 1.weeks()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopDay => query
        .filter(post_aggregates::published.gt(now - 1.days()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopHour => query
        .filter(post_aggregates::published.gt(now - 1.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopSixHour => query
        .filter(post_aggregates::published.gt(now - 6.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopTwelveHour => query
        .filter(post_aggregates::published.gt(now - 12.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopThreeMonths => query
        .filter(post_aggregates::published.gt(now - 3.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopSixMonths => query
        .filter(post_aggregates::published.gt(now - 6.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopNineMonths => query
        .filter(post_aggregates::published.gt(now - 9.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
    };
[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

With performance enhancement in mind....

when listing a single community, filtering logic should be more open-ended, especially if community has little content.

Changes:

  1. Controversial should be consistent with Hot and Active that when it runs out of data it falls back to published DESC.
  2. This makes "published DESC" a consistent sorting pattern for every situation except "Old" and "NewComments".
    query = match options.sort.unwrap_or(SortType::Hot) {
      SortType::Active => query
        .then_order_by(post_aggregates::hot_rank_active.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Hot => query
        .then_order_by(post_aggregates::hot_rank.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::Controversial => query
        .then_order_by(post_aggregates::controversy_rank.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::New => query.then_order_by(post_aggregates::published.desc()),
      SortType::Old => query.then_order_by(post_aggregates::published.asc()),
      SortType::NewComments => query.then_order_by(post_aggregates::newest_comment_time.desc()),
      SortType::MostComments => query
        .then_order_by(post_aggregates::comments.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopAll => query
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopYear => query
        .filter(post_aggregates::published.gt(now - 1.years()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopMonth => query
        .filter(post_aggregates::published.gt(now - 1.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopWeek => query
        .filter(post_aggregates::published.gt(now - 1.weeks()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopDay => query
        .filter(post_aggregates::published.gt(now - 1.days()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopHour => query
        .filter(post_aggregates::published.gt(now - 1.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopSixHour => query
        .filter(post_aggregates::published.gt(now - 6.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopTwelveHour => query
        .filter(post_aggregates::published.gt(now - 12.hours()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopThreeMonths => query
        .filter(post_aggregates::published.gt(now - 3.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopSixMonths => query
        .filter(post_aggregates::published.gt(now - 6.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
      SortType::TopNineMonths => query
        .filter(post_aggregates::published.gt(now - 9.months()))
        .then_order_by(post_aggregates::score.desc())
        .then_order_by(post_aggregates::published.desc()),
    };