About Lemmit

29 readers
1 users here now

About the lemmit.online service and its software.


Got questions, complaints, suggestions? This is the place.

founded 1 year ago
MODERATORS
1
7
submitted 1 year ago* (last edited 11 months ago) by [email protected] to c/[email protected]
 
 

In the short time since this instance and bot launched, I've been seeing the same questions resurface multiple times. This is totally understandable, since the concept of a Fediverse is still new to most (myself included), and this server is not like the others.

Q: What is Lemmit?

A: Lemmit is a Lemmy instance specifically designed for archiving Reddit content. Users can request new subreddits to be included in the archiving process by posting in the [email protected] community. It is powered by an open source python bot, which periodically checks the request list, adds new requests to the queue, and continuously monitors the Hot feed of those subs for new posts to cross-post here.

Q: Does it synchronize comments?

A: No, that would be impossible. Considering there are thousands of posts already on Lemmit, many of them having at least several hundred comments on Reddit, often buried in deep layers, it simply wouldn't be feasible to index those for more than a few posts, let alone keep them up to date.

Unfortunately, this means that archiving certain subreddits, such as Ask Historians/Men/Women/Hyperintelligentshadesofthecolourblue-type subs, is going to be rather pointless.

Q: Can it send comments back to Reddit?

A: No, it cannot. The purpose is to help bootstrap the Lemmy platform, not to serve as a bridge between the two networks. Also, see the answer about synchronizing comments.

Q: Can I request any subreddit?

A: Technically, yes. However, as the list of subs grows, the time it takes to update all of them will also increase. I do not have strict guidelines in place for this, so I'm relying on your common sense (hoooo boy). At some point, I will probably have to either stop accepting new requests or disable scraping for very low-traffic communities.

Q: Does this use the API? Will it keep working after July 1st?

A: Nope, it uses a combination of the public feed and scraping old.reddit.com. So, as long as those are still available, it will continue working. And even if they close those sources, there will probably be new ways to achieve the same effect. "Content, eh, finds a way."

Q: This is spam, can you stop?

A: First of all, I apologise for the inconvenience. All you have to do is block @[email protected], and none of its posts will ever show up on your instance. If you you don't want anyone else on your server to be exposed to this bot/instance, you should convince your admin to defederate from lemmit.online. Since there are no other users on here, there will be no harm done.

Obviously I could stop, because running this server and software is only ever going to cost me time and money. But for the reasons listed above, I still think this server is a useful addition to the lemmyverse at this time. But I'm looking forward to the day where I can turn the bot off because it's no longer needed.

Q: What started this?

A: Okay, nobody asked this, but I'm going to tell you anyway. After Reddit made it clear that they are effectively killing third-party apps and implementing plenty of other anti-end user decisions, I realized that I would either have to accept not being able to access my time-wasting content or have to do so in a rather uncomfortable way (either through the official app or old.reddit.com for as long as they'll allow it to exist).

Being a stubborn developer, naturally, I chose option C: Have my own Reddit. With blackjack, and hookers. This way, I would still be able to access my beloved content without being beholden to Reddit's mood swings and abusive relationship tendencies.

Besides that, I also know that Content is King. So I'm order to counter the network effect (No users because no content, No content because no users), I figured it would be better to have some inorganic content to bootstrap the adoption of Lemmy.

Q: Are NSFW subreddits allowed?

A: Absolutely. Like I said: Blackjack and hookers.

Q: My request isn't picked up by the bot!

A: That isn't a question. But yeah, the process isn't flawless yet. I'm trying to iron out all the bugs as I encounter them. In the meantime, feel free to re-request the subreddit by making a second post. No harm done.

Q: No new posts are showing up at all on Lemmit

A: If no posts are appearing on the Lemmit Frontpage (sorted by NEW), it's possible that the bot has crashed or is stuck on something. Since no software is flawless, this sometimes happens. I usually fix this as soon as I'm aware, and I'm happy to say that these kinds of fatal errors are becoming less and less frequent. However, they may still occur, and as a human with needs of sleep and other responsibilities, I'm not always able to fix them immediately.

Q: Posts aren't showing up on my instance, what's up?

A: Due to the spammy nature of the bot, some server admins choose to block this server, and that is completely understandable. So first of all, make sure to check the instances link in the footer of your home server. If Lemmit is the Blocked Instances list, you're out of luck.

When you have verified that Lemmit is not blocked on your instance, try unsubscribing, waiting a little, and then re-subscribing. That tends to fix things.

2
 
 
3
 
 

I'd like to hear some feedback on this, or approach vectors.

Right now the bot is rather spammy. I was hoping that by using Reddits HOT feed, it would return have some level of quality control (I know, right?). Unfortunately, it seems that in most cases, it will just return anything that's new. The downside of this is that a lot of garbage gets through, and the bot spends a lot of time scraping the underlying page to get the details.

I propose to only archive reddit posts that have a karma score of 5 or higher. In case of subs that hide the karma scores of posts for a certain time, they'd have to be at least 2 hours old, so that the Reddit moderators can weed out garbage on our behalf.

Do you folks have any thoughts on this?

Secondly, I want to put sticky comments on each community, with links to native Lemmy communities that cover the same subject. For this I would need some kind of API, or a master list of... oh, I see sub.rehab has just the thing I need. So expect that somewhere this week :).

4
 
 

So I replied to a comment in a thread the bot posted over in one of the television communities, but I noticed tonight when I was viewing that community in Memmy all of the threads the bot was posting appeared to be empty. Just the title, and the info about where the original post came from and who posted it, but otherwise, pretty useless. I wondered if it was a mistake from the bot or the client I was using, so I tried it over on wefwef, and I saw the links there.

So I guess I don’t know if there’s a bug in how Memmy displays the content, or a bug with how the bot posts, or if it’s just an inconsistency or whatever, but I thought someone might want to know to make some changes.

5
1
submitted 11 months ago* (last edited 11 months ago) by [email protected] to c/[email protected]
 
 

See you on the other side!


So the update is done, but the bot was offline for 6 hours, and needed to catch up.

Unfortunately, another update slipped through, which switched the default feed from www.reddit.com to old.reddit.com, which has the side effect of changing all the urls in the posts as well. On one hand this is great, because new reddit sucks. On the other hand, this is terrible, because for every post the bot encounters, it checks if it already exists on lemmit... based on the url.

So for every post the bot encountered, it went like "old.reddit.com/r/blabla/123? Haven't seen that one yet, there's an www.reddit.com/r/blabla/123, but that must be something completely different, let's post it again!"

This also meant that the bot took over a minute and a half to update each community because it takes a couple of second per post. When I went to bed last night, I figured it was just posting a lot of content because it had so much catching up to do. But this morning I figured something was off because it still hadn't caught up.

Anyway, the fix is out now. Sorry for all the duplicates. I need coffee now.

6
 
 

ChatGPT, write a post for the stuff that I have in my head and want to get out as an update.

Hmm. No brain implant yet. Guess I'll have to write this the hard way.

Syncing update

It has been an eventful week. I successfully deployed the initial version of smarter content syncing, and have made some adjustments to algorithm since then. Most notably, communities with only 1 subscriber (the bot) will no longer receive updates, and communities with fewer than 5 subscribers or with a low posting frequency will only be updated twice a day. Furthermore, for the highest update priority (every 10 minutes), a community must have a minimum of 50 subscribers. Implementation details can be found in the decide_interval() method over here.

Being a developer is fun

Meanwhile... Damnit, bot is stuck again.

2023-07-08 10:13:39,945 - utils.syncer - INFO - Scraping subreddit: bustynaturals. Last time  2:30:48 ago, interval 120 minutes
2023-07-08 10:13:40,653 - utils.syncer - INFO - 'latina bodies are the best' at https://www.reddit.com/r/BustyNaturals/comments/14twww8/latina_bodies_are_the_best/ updated: 2023-07-08 07:14:13+00:00
2023-07-08 10:13:45,324 - utils.syncer - ERROR - Error trying to retrieve post details, try again in a bit; Couldn't retrieve post detail page
2023-07-08 10:13:46,333 - utils.syncer - INFO - Scraping subreddit: bustynaturals. Last time  2:30:54 ago, interval 120 minutes
2023-07-08 10:13:48,581 - utils.syncer - INFO - 'latina bodies are the best' at https://www.reddit.com/r/BustyNaturals/comments/14twww8/latina_bodies_are_the_best/ updated: 2023-07-08 07:14:13+00:00
2023-07-08 10:13:51,227 - utils.syncer - ERROR - Error trying to retrieve post details, try again in a bit; Couldn't retrieve post detail page
...

1 bugfix and deployment later:

2023-07-08 10:46:42,836 - utils.syncer - INFO - Scraping subreddit: bustynaturals. Last time  3:03:51 ago, interval 120 minutes
2023-07-08 10:46:43,573 - utils.syncer - INFO - 'latina bodies are the best' at https://www.reddit.com/r/BustyNaturals/comments/14twww8/latina_bodies_are_the_best/ updated: 2023-07-08 07:14:13+00:00
2023-07-08 10:46:48,327 - utils.syncer - ERROR - Couldn't find post on https://old.reddit.com/r/BustyNaturals/comments/14told8/latina_bodies_are_the_best/, skipping.

Defederation

Meanwhile, the folks at https://lemmy.world reached out to me to tell me they're defederating Lemmit. They are not fond of high volume of posts made by the bot, and the fact that there are now (quick check) 462 communities on this server all being moderated by a single person. They have already received a couple of complaints about spam, and it didn't help that some requests for NSFW subreddits were not marked as NSFW. Occasionally, those subreddits had explicit thumbnails that appeared in the 'All feed' without warning.

I had a good talk with the LemmyWorld admin, wherein they explained their point of view, and I explained mine. I understand their decision to disassociate with Lemmit, and appreciate their attempt to contact me. Other instances like Beehaw, and some smaller ones have also reached the same decision.

This does mean that you will no longer be able to get new community updates on those servers. So make sure to check the blocked instances list on your home server if you were subscribed to Lemmit. At the same time I have removed all the subscriptions of users from those servers, in order to not affect the sync priority mentioned above. This does mean, that if LemmyWorld, Beehaw, etc ever decide to connect to Lemmit again (however unlikely), you will need to un- and re-subscribe from there.

Meanwhile, I've added a feature in the bot that will remove request posts for NSFW subreddits, if the post itself is not marked for NSFW. This should prevent explicit thumbnails showing up where they are not wanted.

Server growth

Last night I got an alert from my server monitoring that the disk is 80% full. Unfortunately, the disk is only 60 GB, so that doesn't leave much room for expansion. On the bright side, a good chunk of that is from Lemmys very verbose logging (like, 4 GB a day, which gets cleaned up daily), so it should last throughout the weekend if I tune that down. Furthermore, most of the storage growth is from from pictrs, the image upload part of Lemmy, and that can utilize an S3 bucket, rather than using the VM's storage like it is now. Using an S3 bucket offers a cost-efficient solution for expanding storage. Initial estimates indicate a monthly cost of around $5 for 1000 GB of storage, which should be sufficient for a while *fingers crossed*.

In the early days of Lemmit (literally, as the server is less than a month old) image uploads were limited to a default setting, which was something around 40 megabytes. That did add up quickly (thanks to half-minute porn gifs), and so I had to limit the max filesize to 1 MB, and later 0.5 MB. Once the server has switched to S3 storage, I can probably up that limit a little, although not too much.

Finally, Lemmy v0.18.1 has been released, and it contains even more performance boosts compared to v0.18.0, so if there's time left this weekend (and I can verify the Lemmit Bot is compatible), I will probably perform the upgrade.

7
 
 

In particular, posts to NSFW videos hosted on v.redd.it don't work on the www version. The links take you to the comments page, which blocks NSFW content, and nags you to go to the app.

old.reddit.com links just work without logging in.

8
 
 

See the bot in action here!

My instance running Leddit

Click here for a more detailed explanation about the bot's purpose

This bot is intended to be self-hosted. Unfortunately, I can't operate a public instance that takes subreddit requests because of how long syncing comments takes. For comparison, Lemmit takes 21 minutes to sync all of the subreddits on this instance using the old system, but Leddit takes the same amount of time to sync 3 subreddits with around 500k subscribers each once an hour. Smart syncing is planned, but it won't decrease the amount of time taken to sync big and active subreddits.

If you need help setting up an instance, feel free to ask questions in this thread or on the Leddit instance's community.

9
 
 

I want to follow many of the NSFW subreddit-communities from this account and probably request more, but the communities page doesn't show NSFW when not logged in (making an account on lemmit itself just to see the list seems like overkill and could be confusing about what's logged in where, if it's even allowed)

10
4
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 
 

Okay, this one took me a bit longer than I planned (mostly due to sql fun and trying to use integers as minutes, WEEEE!).

Backdrop: Last week I disabled the mirroring of a couple of subreddits to the database, because they were initially requested but the nobody subscribed to them. At the same time, the bot was just crawling in a loop, starting at todayilearned, ending at latestsubreddit. As more subreddits were requested, this loop took longer and longer (21 minutes before I rolled out this update). This wasn't sustainable.

So here's the new situation. The more popular a community is, the more often it will be updated. In this case popular means a mixture between number of subscribers and the amount of posts it receives per day (Link to relevant snippet of source code).

In short, the most popular subs will be synced every 10 minutes, the next tier ever 30 minutes, 120 minutes and the content with either no posts per day or no subscribers (other than the bot), will only be synced every 12 hours. I hope this will hit a good distribution of updates vs popularity, but it will most likely be refined at some point in the future.

Speaking of distribution, we now have over 300 communities on this server 🥳, and their update intervals are spread out as such:

  • Every 10 minutes: 22
  • Every 30 minutes: 39
  • Every 60 minutes: 55
  • Every 120 minutes: 143
  • Every 720 minutes: 44

With this update running live (I started typing after I deployed it, and it has now gotten through the backlog of 'abandoned' subs), I'm going to step back from feature development for a few days. Any bugs that cause the bot to crash will of course continue to be addressed.

Have a blast!

11
 
 

Just as a disclaimer: I'm not complaining, it's great that all this exists at all :)

I'm just checking if this may be a bug or not: if I compare https://www.reddit.com/r/DotA2/new/ and https://lemmit.online/c/dota2?dataType=Post&page=1&sort=New then the bot is definitely skipping some posts.

For example, between "NothingToSay is called 'responsibility god' in CN Dota2 community." and "TIP: Medusa doesn’t reduce magic damage to her mana when BKB is up" there are 3 other posts on reddit directly, which are missing on lemmit.online

I'm just hoping this can be fixed since this bot makes populating the "real" dota2 community I moderate much easier, but some posts I want to cross-post are missing so I need to do shit manually.

12
 
 

Have you considered doing something similar for Mastodon, to allow interacting with toots within Lemmy UI? I know the opposite is possible and Kbin also has some kind of integration, but that doesn't seem to fully work at the moment either.

13
 
 

For example, it'd be nice for anyone that looks at/finds https://lemmit.online/c/dota2 to also find that https://discuss.tchncs.de/c/dota2 or https://lemmy.world/c/dota2 is an actual community that corresponds to that with user content, not bot content.

I'm sure there's lots of equivalents for other communities as well where that would make sense.

14
 
 

Before was running on the cheapest model (1 core / 1GB mem / 30GB storage) at $12/month. The machine was running pretty low on memory, causing it to start swapping, which in turn caused the cpu to get too busy, and everything to slow down.

Now it has a whopping 2GB of memory, and things seem to have calmed down - cpu is back to around 10-15% usage, and swap is down to 0. Happy times all around.

Because of the amount of subs being archived, it now takes about 15 minutes between updates for each sub (was 18 before I updated the VM).

I'm planning to build some kind of scoring system, based on the amount of posts per subreddit (per day?), and amount of subscribers on the lemmy community. That way communities with little subscribers or that don't see many posts per day, will only be updated once per hour.

At the same time, I feel that subs like AskReddit, OutOfTheLoop and other "question-based" subreddits shouldn't be archived by Lemmit. In my opinion those kind of posts are useless without those answers, but please let me know if you disagree.

15
 
 

[email protected] doesnt match up with r/perchance. its last post was 5d ago.

16
 
 
  • Fixed a bug where posts would not be submitted because the title didn't contain long enough words.
  • Fixed a bug where posts would not be submitted because the url was too long.
  • Fixed a bug where posts would not be submitted when it was linking to a /user subreddit.
  • Fixed a bug where the bot would think Every Post Everywhere was a subreddit request, and would reply to it.
  • Fixed a bug where the bot would crash without recovering whenever something went wrong during new subreddit requests

A fruitful day all in all, I'd say.

17
 
 

That the replies-everywhere-bug was just because I forgot to include a variable in the bot deployment? 🤦

18
 
 

Long story short: I messed up with the domain registration for this instance, and never replied to a mandatory email. The domainname (lemmit.online) got put in suspension, causing disconnects all over the fediverse.

I fixed it as soon as I found out, but it will probably take a few more hours for the issues to be fully fixed.

So ehm. Whoops. Hope this explains and fixes the federation issues we've been having today.

19
 
 

If Lemmit is federated, doesn't that mean that all of its crossposts of reddit content should show up in everyone's "All" feed (for example, on Lemmy.world)?

20
 
 

When I went to https://feddit.de/c/[email protected] I noticed the sidebar links to /c/about - which doesn't exist on feddit.de

I'd suggest to change the link to https://lemmit.online/c/about so people from other instances can find it more easily.

21
 
 

Current reply:

I'll get right on that. Check out /c/[email protected]!

Proposed reply:

I'll get right on that. Check out /c/[email protected]!

Click here to fetch this community for your lemmy instance if you get a 404 error with the link above.

22
23
 
 

Most importantly that the bot no longer crashes (and does nothing all night while I sleep 😛) when trying to create a community that has already been requested.

Furthermore mostly making the code prettier and adding tests.

24
 
 

In card view all you see is the bot header and none of the actual content. Maybe it should go at the bottom and just prepend something short like [Lemmit Bot]

This bit.. This is an automated archive made by the Lemmit Bot. The original was posted on /r/ayaneo by /u/HystericalBanana on 2023-06-17 21:43:04+00:00.

25
 
 

Hey, I'm working on a fork for the Lemmit bot that cross-posts the comments in a thread into Lemmy, while I removed subreddit requests so the bot only checks a few hard-coded subreddits and doesn't get rate-limited. The subreddits I want to copy are discussion-heavy and the comments are often in nested threads. While I can simplify the number of requests by fetching the RSS feed for posts and iterating through each post's RSS feed for comments, this causes the comment thread's structure to be lost as the feed returns all of the comments in the thread without any tags to identify parent and child comments. Scraping each individual comment seems too inefficient, especially in large threads with thousands of comments. Would anyone happen to know of the best way to retrieve comments while preserving the thread structure and not using the API, if that is possible?

view more: next ›