this post was submitted on 04 Feb 2024
432 points (89.1% liked)
memes
10398 readers
1920 users here now
Community rules
1. Be civil
No trolling, bigotry or other insulting / annoying behaviour
2. No politics
This is non-politics community. For political memes please go to [email protected]
3. No recent reposts
Check for reposts when posting a meme, you can only repost after 1 month
4. No bots
No bots without the express approval of the mods or the admins
5. No Spam/Ads
No advertisements or spam. This is an instance rule and the only way to live.
Sister communities
- [email protected] : Star Trek memes, chat and shitposts
- [email protected] : Lemmy Shitposts, anything and everything goes.
- [email protected] : Linux themed memes
- [email protected] : for those who love comic stories.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
if you look at the repo they give thanks to:
"The commoncrawl organization for crawling the web and making the dataset readily available. Even though we have our own crawler now, commoncrawl has been a huge help in the early stages of development."
There is nothing I can find which says how much of the index is CC and how much is their own; if there's a decent amount of CC, this is originally for researchers etc. it's not the best resource in the world for a search index: https://commoncrawl.org/
That being said, as an independent search engine, it's always good to see people take on the massive task of actually building an index, not becoming a proxy.