this post was submitted on 15 Jun 2023
54 points (98.2% liked)

Lemmy.World Announcements

29286 readers
9 users here now

This Community is intended for posts about the Lemmy.world server by the admins.

Follow us for server news 🐘

Outages πŸ”₯

https://status.lemmy.world

For support with issues at Lemmy.world, go to the Lemmy.world Support community.

Support e-mail

Any support requests are best sent to [email protected] e-mail.

Report contact

Donations πŸ’—

If you would like to make a donation to support the cost of running this platform, please do so at the following donation URLs.

If you can, please use / switch to Ko-Fi, it has the lowest fees for us

Ko-Fi (Donate)

Bunq (Donate)

Open Collective backers and sponsors

Patreon

Join the team

founded 2 years ago
MODERATORS
 

One of the most unforgivable things about reddit is how pathetic the search engine is, considering the amount of free, top notch information is captured there and you need google +reddit to get at it, what can we do to make federated alternatives self searchable ?

top 14 comments
sorted by: hot top controversial new old
[–] orionstein 16 points 2 years ago (1 children)

This is going to be even worse than reddit search, unfortunately. There's not an easy way to make a search like this scale for the small amount of instances we even know about. Considering there are tons of instance out there and there will probably be more in the future, these problems are going to crop up a lot more. It's actually much easier to search in one centralized location, however the reddit search actually ended up being implemented.

[–] [email protected] 3 points 2 years ago (1 children)

It does become very fragmented. A post on my single-user server is going to be low down the rankings compared to the same post on a subreddit with the weight of the reddit domain name behind it. I'm also not entirely sure if/how content here gets indexed, especially when it appears under different federated domains. Content discovery is very different in a distributed world.

[–] MalReynolds 2 points 2 years ago

Perhaps not, google et.al. will likely grab it all anyway, perhaps we can be forward facing. Actually while they're privacy invading, the benefit of keeping feely given info still stands, and may, in the long term, prevail. One may hope...

[–] MalReynolds 8 points 2 years ago (1 children)

Discord, for example, means all useful information is captured by discord, never to be searched by plebs. IRC is usually ephemeral. Most web search has been diluted by SEO and content farms to the point of uselessness. Perhaps we can think about next gen search right now. A point of hope is things like gigabrain which, it would seem, use LLMs to 'cut through the noise', but also summarize and collate, seems like a useful way forward if distributed. Happy to look into it myself, but would like to hear others input. (pleasently ppl were commenting before I finished)

[–] lenninscjay 6 points 2 years ago (1 children)

Don’t know how to help but agree on how important search is. Which might be even harder to do given federation.

Also upvote for firefly user name

[–] [email protected] 5 points 2 years ago

Eventually I hope lemmy.directory will be great for this purpose. It's a Lemmy instance configured to pick up every Lemmy community it can find.

[–] [email protected] 3 points 2 years ago

I work for a small company that runs a website with lots of information and our search has always sucked. We tried several tweaks and free solutions - the final decision was to pay for search which is what we did and it is awesome now, but expensive. A major company like Reddit should be able to figure it out, but search is harder than most people realize. Google just makes it look easy.

[–] [email protected] 2 points 2 years ago (1 children)

Simplest implementation is that an instance searches its own content while sending requests to federated instances and merging their results in with its own based on whatever method the instance admins want (whether it puts its own results at the top, or treats them as one set, or whatever). That could cause a lot of traffic and has a load of latency while your search spreads out hop by hop, to the instances that yours is federated with, to the ones they're federated with, etc. Plus you'd need a mechanism to stop instances from sending a search to an instance that's already got it, to avoid hammering instances that have multiple federation paths to yours. Not an easy problem.

You might be able to do some kind of index publication where an instance publishes the most notable posts for other instances to include in their indexes, so that when you search it could show you results from among hot posts elsewhere in the fediverse - not an exhaustive list, but a search within posts that are getting attention.

There's also other stuff I'd be tempted to experiment with, like using some kind of TF-IDF ranking to choose what counts as "most notable", rather than just activity or view count, so that posts that are particularly relevant to certain topics could be publicised. An instance could even choose to filter that, so for example an instance who chooses to focus on tech topics could publicise highly-relevant tech posts but filter out politics keywords even when a post gets high relevance scores, so that political discussion on that instance is less visible, even when searched for.

[–] MalReynolds 1 points 2 years ago (1 children)

Thankyou for applying soilid thought. What there would you consider actionable ? As in could likely be coded (for free)

[–] [email protected] 1 points 2 years ago

Any of that could be done; there's some parts that are more challenging but there are certainly harder things that have been solved by open-source software. I know almost nothing about how Lemmy's innards are built though, so I couldn't hazard a guess as to how much effort any of it would take. Some of it could possibly be achieved through separate services that you could host alongside a Lemmy instance, or entirely on their own, while other parts would really work best as features within Lemmy's own codebase.

[–] [email protected] 2 points 2 years ago

In the past I normally used Pushshift to search Reddit due to how poor the search engine was. I think it was only until very recently when they finally added comment searching.

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago)

I've posted about this before as it relates to mod tools.^1^

The search part isn't all that difficult, there are open source search engines that are easy enough for admins to configure a decent search feature. The more difficult issue is aggregating the data from all our instances to a single source where we can make queries with those existing search engine tools.

I am going to spend some time this weekend working on a proof of concept for a search engine for mod tools. Big picture solution is:

  1. Instance admins regularly dump anonymized (i.e. no PII) post and comment data to a public source (possibly torrent, possibly sftp)
  2. Other instance admins download each others data and feed it into their search db (e.g. Elasticsearch)
  3. Mods & users create tools using this data

BTW: this isn't a novel idea:

  • This is what pushshift is for reddit (check out their FAQ/wiki). We're missing mod tools big time and searching/aggregating is huge part of mod tools.
  • Up until recently, like last week, Stack Exchange provided a regular dump of their data to the Internet Archive for posterity's sake

EDIT: Linked my OG post on the subject ^1^