this post was submitted on 24 Jun 2023

52 points (100.0% liked)

Lemmy

2172 readers

110 users here now

Everything about Lemmy; bugs, gripes, praises, and advocacy.

For discussion about the lemmy.ml instance, go to [email protected].

founded 4 years ago

MODERATORS

[email protected]

Is Lemmy search-engine unfriendly? (vlemmy.net)

submitted 1 year ago by [email protected] to c/[email protected]

29 comments fedilink hide all child comments

Any post and community could be accessed through a theoretically limitless amount of instances, which also means a theoretically limitless amount of URLs.

Will this hinder Lemmy from ever coming into the mainstream? If I type any topic in Google, I will get a reddit thread that deals with that. Can something like that ever happen for Lemmy?

top 26 comments

sorted by: hot top controversial new old

[–] [email protected] 22 points 1 year ago (1 children)

I think if canonicals are applied correctly, it should not be an issue?

[–] [email protected] 6 points 1 year ago (1 children)

I think you're right. Looking at the html source for this page I don't see a canonical tag, though. Maybe they haven't added it yet? Or I missed it.

[–] marsara9 2 points 1 year ago (2 children)

Would the canonical tag make any sense for Lemmy? The problem is, if you search for something your preferred site / URL is your instance. So the canonical would be different for every user?

[–] [email protected] 4 points 1 year ago

Yeah, I think that's one of the user experience issues we're facing. Setting the canonical as the original server makes the most sense, but that would mean if you find something interesting via a search engine you have to figure out how to get it to show up on your home instance.

Like for me, since I run my own instance for myself and one other person so far, I have to find interesting communities manually. It's really annoying. Though, looking at Lemmy v0.18 release notes, a lot of new devs have made contributions and I'm sure more will help in the future. One improvement from yesterday's release is visiting a remote community on your home server will pull the community rather than returning a 404. I think changes like that are big first steps towards improving this specific aspect of the user experience.

[–] [email protected] 4 points 1 year ago

The canonical makes sense for the search engine (eg Google). I would put the canonical on the source instance.

Leaves open the question what would happen if the source would disappear…

[–] fubo 16 points 1 year ago (2 children)

Currently it appears that a non-logged-in user (try an incognito window!) will only see posts on a particular server's local communities. So a search engine bot crawling multiple Lemmy servers will only see duplicates if they've been explicitly crossposted.

[–] [email protected] 7 points 1 year ago* (last edited 1 year ago)

No, you can definitely browse "all" while logged out. It just defaults to local.

[–] [email protected] 4 points 1 year ago

Yes I'm sure that it won't take too long for search engines to cotton on and start indexing Fediverse stuff - after all it's going to be getting linked from other sites, which will cause the spider to head to the source URL and start having a gander.

[–] [email protected] 11 points 1 year ago (1 children)

This is in fact my biggest worry of Lemmy's future. People need to be able to search for stuff and I currently don't see how.

[–] marsara9 44 points 1 year ago (5 children)

I'm doing tests in the next couple days. But I'm trying to build a search engine specifically for Lemmy.

It should in theory work similar-ish to Google / Bing.
You can filter by instance, community or author.
it only indexes Lemmy posts and it won't keep duplicates.
It'll also open any link you find in your instance.
You'll be able to self host it and point it to any instance you want as well.

I'm hoping I can open it to the public in a week or so.

[–] [email protected] 3 points 1 year ago (1 children)

Cool! How does it technically work? Does it fetch all titles (and maybe the body and comments) via the api from each instance or do you set up your own private instance and tap into the instance database?

[–] marsara9 7 points 1 year ago* (last edited 1 year ago) (1 children)

I'm using the public API to grab every post / comment and then I essentially replace the content with only the unique words. Then when you go to search it just looks for any post or comment, in my database, that has the words you typed in. Finally I sort based on the number of upvotes.

Right now it only craws a specific instance that you point it to. But as long as that instance is federated it /should/ get everything. But eventually I plan on using that instance's list of federated instances to scan everything and lighten the load on any one particular instance.

Edit: I thought about tapping into the existing database but the existing database is more geared towards serving content but not necessarily searching. The database that I'm building you can search but I drop so much of the original data that using it for content is worthless.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (1 children)

Now I'm curious what your stack is? Are you using an elastic database?

[–] marsara9 8 points 1 year ago

HTML + JavaScript frontend. Rust backend with a postgres database.

It'll be open sourced once I can get the MVP ready.

[–] [email protected] 3 points 1 year ago (1 children)

Would it be possible to also integrate kbin?

[–] marsara9 1 points 1 year ago

I originally wanted to include the entire fediverse but ActivityPub might have some limitations.

Mainly it appears to be incredibly chatty. So in order for it to work my server, which right now just a RaspberryPi, would need to handle server requests for every single action, or Activity, that any user takes. But long story short every action someone takes on the fediverse creates approximately 1 network request for every instance in the fediverse. With approximately 10,000 instances... you now see the problem.

Now a ray of light in all of that is that I assume Kbin has a public API as well?, and I should be able to create an interface for that. But it's not on my radar for the first release.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (1 children)

Please make sure that you're only indexing Lemmy communities and Kbin magazines (i.e. not microblogs)

In the wider fediverse, there is an actual expectation of privacy beyond "well it's technically possible to scrape everything so we may as well give up". Several people (with reasons of innocent naivete & explicit and blatant malice alike) have tried making fediverse search engines, but all of them are either dead or blocked.

Lemmy/Kbin is in a unique position where global search does make some sense to have, due to it being a public forum focused on topics (and not people), but there is a very real chance that assholes could use an "unbounded" fediverse search engine to find vulnerable people (quite a few of them specifically fleeing to the fediverse to avoid that kind of problem) and harass them.

[–] [email protected] 2 points 1 year ago (1 children)

The concept of privacy within today's Fediverse is asinine and everyone should be pointing that out at every opportunity. Doing otherwise, making believe that some sort of code of conduct or public shame cycle is somehow going to keep people safe, is ridiculous and even more dangerous than a public search engine. By not talking about, very loudly, just how trivial it is to gather this data and how impossible it is to remove it we're sticking our heads in the sand and there will be people who suffer as a result.

[–] [email protected] 1 points 1 year ago

Feel free to go do that then

[–] muaveri 1 points 1 year ago

sounds promising, can't wait to test it

[–] [email protected] 9 points 1 year ago (1 children)

It's probably the search engine that is unfriendly to Lemmy and others.

[–] [email protected] 3 points 1 year ago

I like you philosophical view of the matter

[–] T156 7 points 1 year ago* (last edited 1 year ago) (1 children)

I think so, at least compared to more centralised sites, since there's no index or aggregator that concentrates that information in a place that can be easily accessible.

Unlike some place like Reddit, I can't just stick "Lemmy" or "site:lemmy.ml" on the end of my link, and expect that I would be able to get all the information across all the instances. At best, I would be able to search for what's contained within an individual instance, but that's about it, which makes it more trouble to search for things, since you have to know the server and community you're looking for first.

[–] [email protected] 3 points 1 year ago

What's contained within an individual instance is whatever's viewable from that instance, though, as remote content is mirrored locally. So, any early instance that is well federated and subscribed to a large number of remote communities should work well as a search target.

So site:lemmy.ml actually shouldn't end up being too and of a query.

load more comments