this post was submitted on 05 Mar 2024

163 points (95.0% liked)

DeGoogle Yourself

7743 readers

74 users here now

A community for those that would like to get away from Google.

Here you may post anything related to DeGoogling, why we should do it or good software alternatives!

Rules

Be respectful even in disagreement
No advertising unless it is very relevent and justified. Do not do this excessively.
No low value posts / memes. We or you need to learn, or discuss something.

Related communities

[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

founded 5 years ago

MODERATORS

[email protected]

Byereddithellolemmy

[email protected]

163

In an age of LLMs, is it time to reconsider human-edited web directories? (aus.social)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

80 comments fedilink hide all child comments

In an age of LLMs, is it time to reconsider human-edited web directories?

Back in the early-to-mid '90s, one of the main ways of finding anything on the web was to browse through a web directory.

These directories generally had a list of categories on their front page. News/Sport/Entertainment/Arts/Technology/Fashion/etc.

Each of those categories had subcategories, and sub-subcategories that you clicked through until you got to a list of websites. These lists were maintained by actual humans.

Typically, these directories also had a limited web search that would crawl through the pages of websites listed in the directory.

Lycos, Excite, and of course Yahoo all offered web directories of this sort.

(EDIT: I initially also mentioned AltaVista. It did offer a web directory by the late '90s, but this was something it tacked on much later.)

By the late '90s, the standard narrative goes, the web got too big to index websites manually.

Google promised the world its algorithms would weed out the spam automatically.

And for a time, it worked.

But then SEO and SEM became a multi-billion-dollar industry. The spambots proliferated. Google itself began promoting its own content and advertisers above search results.

And now with LLMs, the industrial-scale spamming of the web is likely to grow exponentially.

My question is, if a lot of the web is turning to crap, do we even want to search the entire web anymore?

Do we really want to search every single website on the web?

Or just those that aren't filled with LLM-generated SEO spam?

Or just those that don't feature 200 tracking scripts, and passive-aggressive privacy warnings, and paywalls, and popovers, and newsletters, and increasingly obnoxious banner ads, and dark patterns to prevent you cancelling your "free trial" subscription?

At some point, does it become more desirable to go back to search engines that only crawl pages on human-curated lists of trustworthy, quality websites?

And is it time to begin considering what a modern version of those early web directories might look like?

@degoogle #tech #google #web #internet #LLM #LLMs #enshittification #technology #search #SearchEngines #SEO #SEM

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 24 points 1 year ago (1 children)

Lycos, Excite, AltaVista, and of course Yahoo all were originally web directories of this sort.

Both Wikipedia and my own memory disagree with you about Lycos and AltaVista. I'm pretty sure they both started as search engines. Maybe they briefly dabbled in being "portals".

[–] [email protected] 4 points 1 year ago (1 children)

@bsammon And this Archive.org capture of Lycos.com from 1998 contradicts your memory: https://web.archive.org/web/19980109165410/http://lycos.com/

See those links under "WEB GUIDES: Pick a guide, then explore the Web!"?

See the links below that say Autos/Business/Money/Careers/News/Computers/People/Education /Shopping/Entertainment /Space/Sci-Fi/Fashion /Sports/Games/Government/Travel/Health/Kids

That's exactly what I'm referring to.

Here's the page where you submitted your website to Lycos: https://web.archive.org/web/19980131124504/http://lycos.com/addasite.html

As far as the early search engines went, some were more sophisticated than others, and they improved over time. Some simply crawled the webpages on the sites in the directory, others

But yes, Lycos definitely was definitely an example of the type of web directory I described.

[–] [email protected] 6 points 1 year ago

1998 isn't "originally" when Lycos started in 1994. That 1998 snapshot would be their "portal" era, I'd imagine.

And the page where you submitted your website to Lycos -- that's no different than what Google used to have. It just submitted your website to the spider. There's no indication in that snapshot that suggests that it would get your site added to a curated web-directory.

Those late 90's web-portal sites were a pale imitation of the web indices that Yahoo, and later DMoz/ODP were at their peak. I imagine that the Lycos portal, for example, was only managed/edited by a small handful of Lycos employees, and they were moving as fast as they could in the direction of charging websites for being listed in their portal/directory. The portal fad may have died out before they got many companies to pony up for listings.

I think in the Lycos and AltaVista cases, they were both search engines originally (mid 90s) and than jumped on the "portal" bandwagon in the late 90s with half-assed efforts that don't deserve to be held up as examples of something we might want to recreate.

Yahoo and DMoz/ODP are the only two instances I am aware of that had a significant (like, numbered in the thousands) number of websites listed, and a good level of depth.

[–] [email protected] 20 points 1 year ago (1 children)

Main problems are:

Link rot
Sneakily inserted sponsored links

[–] [email protected] 6 points 1 year ago

@Moonrise2473 @ajsadauskas
3. Infinitely growing list of categories.
4. Mis-categorisation

i remember learning HTML (4.0) and reading that you should put info in a <meta> tag about the categories your page fits in, and that would help search engines. Did it also help web directories?

[–] [email protected] 14 points 1 year ago (2 children)

@ajsadauskas @degoogle I guess the problem though is how you make sure they are actually maintained by a human acting in good faith. The way community Facebook groups meant to be for this kinda thing get spammed by likely fake businesses doesn’t give me hope

[–] [email protected] 4 points 1 year ago (1 children)

@joannaholman @degoogle Good point.

If it were run as a private company, I think the solution might be just to pay actual humans as employees.

If it's a community-run project, the challenge would be to come up with a robust moderation system...

[–] [email protected] 2 points 1 year ago

@ajsadauskas @joannaholman @degoogle maybe a mix of wikipedia and search engine would be nice. WikiSearch?

[–] [email protected] 2 points 1 year ago

I suppose any measures at all would cut out a massive number of spam pages already.

[–] [email protected] 12 points 1 year ago

@ajsadauskas @degoogle What we need to do is re-visit the GnuPG philosophy of building rings of trust. If one emerges with enough people proven to provide quality aggregators/summarizers then we can start to depend on that, or those.

[–] [email protected] 10 points 1 year ago

@ajsadauskas @degoogle Webrings! Bring back Webrings!

[–] [email protected] 9 points 1 year ago (1 children)

I'd argue that link aggregators like Lemmy (from which I'm posting o/) are the new world version of that. Link aggregators are human-edited web directories; humans post links and other humans vote whether those links are relevant to the "category" (community) they're in. The main difference is that it's an open communal effort with implicit trust rather than closed groups of permitted editors.

[–] SomeKindaName 7 points 1 year ago (2 children)

The problem is bots

[–] Potatos_are_not_friends 3 points 1 year ago* (last edited 1 year ago)

I'm sadden to say that one of my jobs in 2014 was to build bots for a company. And the first thing they did was use it to spam social media with links, and bots to reply to the original bot to appear more human and give it "authority" and "social proof". That practice boosted sales dramatically.

Some of the bot libraries that are openly available, along with AI, makes the things I did look like child play.

load more comments (1 replies)

[–] [email protected] 9 points 1 year ago (1 children)

Reddit and Lemmy are supposed to be what you want: link aggregators.

We're supposed to link to sites and pages and people vote on how good they are in the context of the sub community topic.

Of course, then Ron Paul happened, and now it's just memes and Yank politics so... maybe deploy Lemmy and turn off comments.

[–] [email protected] 3 points 1 year ago (1 children)

I think you are mostly right, except Lemmy and reddit are not organized.

load more comments (1 replies)

[–] [email protected] 9 points 1 year ago (8 children)

I used them and contributed to links as well - it was quite a rush to see a contribution accepted because it felt like you were adding to the great summary of the Internet. At least until the size of the Internet made it impossible to create a user-submitted, centrally-approved index of the Net. And so that all went away.

What seemed like a better approach was social bookmarking, like del.icio.us, where everyone added, tagged and shared bookmarks. The tagging basically crowd-sourced the categorisation and meant you could browse, search and follow links by tags or by the users. It created a folksonomy (thanks for the reminder Wikipedia) and, crucially, provided context to Web content (I think we're still talking about the Semantic Web to some degree but perhaps AI is doing this better). Then after a long series of takeovers, it all went away. The spirit lives on in Pinterest and Flipboard to some degree but as this was all about links it was getting at the raw bones of the Internet.

I've been using Postmarks a single user social bookmarking tool but it isn't really the same as del.icio.us because part of what made it work was the easy discoverablity and sharing of other people's links. So what we need is, as I named my implementation of Postmarks, Relicious - pretty much del.icio.us but done Fediverse style so you sign up to instances with other people (possibly run on shared interests or region, so you could have a body modification instance or a German one, for example) and get bookmarking. If it works and people find it useful a FOSS Fediverse implementation would be very difficult to make go away.

[–] [email protected] 3 points 1 year ago (1 children)

Pinboard and TinyGem come to mind.

[–] [email protected] 3 points 1 year ago

Oh indeed there are services out there that do something similar to Delicious, but I put a lot into that site only for it all to disappear due to the whims of some corporate overlord and I am not doing that again. What I am looking for is an easy Fediverse solution so my data is never lost again. Postmarks is definitely getting there but as a single-user service it isn't quite what I am looking for.

[–] [email protected] 2 points 1 year ago (1 children)

@Emperor
This this this! Some kind of service that would sit alongside a fedi instance and serve as a community directory.
@ajsadauskas

load more comments (1 replies)

load more comments (6 replies)

[–] _WC 8 points 1 year ago

I had this exact thought earlier today. Either curated directories, or a ground-up, vetted search engine that only pulls from pre-screened sources.

[–] merthyr1831 7 points 1 year ago

This is how it's gonna go. we'll get human-curated search results, before someone "innovates" by mildly automating the process until someone "innovates" again by using AI to automate it further. Time is a circle

[–] [email protected] 6 points 1 year ago (1 children)

@ajsadauskas @degoogle i love this idea, i'm going to start my own web directory.

[–] [email protected] 4 points 1 year ago

Do it!

Then federate it.

[–] [email protected] 5 points 1 year ago* (last edited 1 year ago)

@ajsadauskas @degoogle I actually contributed to one! I was a writer at LookSmart for four years; we manually created categories and added websites to then, with short descriptive reviews. Though an algorithm listed more sites below our selections, we could force the top result, eg we'd make sure the most relevant website was the first result of a search on that topic. Old-skool now, but had better results in some ways.

[–] [email protected] 5 points 1 year ago (2 children)

@ajsadauskas @degoogle Since I run a small directory this is a fascinating conversation to me.

There is a place for small human edited directories along with search engines like Wiby and Searchmysite which have human review before websites are entered. Also of note: Marginalia search.

I don't see a need for huge directories like the old Yahoo, Looksmart and ODP directories. But directories that serve a niche ignored by Google are useful.

[–] [email protected] 2 points 1 year ago

@bradenslen @ajsadauskas @degoogle looksmart! There's a blast from the past.

As a very early internet user (suburbia.org.au- look it up, and who ran it) and a database guy, what I learnt very early is that any search engine needed users who knew how to write highly selective queries to get highly specific results.

Google - despite everything - can still be used as a useful tool - if you are a skilled user.

I am still surprised that you are not taught how to perform critical internet searching in primary school. It is as important as the three Rs

load more comments (1 replies)

[–] [email protected] 5 points 1 year ago

And is it time to begin considering what a modern version of those early web directories might look like?

Something like fmhy.net?

[–] [email protected] 4 points 1 year ago

The tale of the internet has been curation, and I would describe it a little differently.

First we had hand made lists of website (Yahoo directory, or we had a list of websites literally written in pen in a notebook saying "yahoo.com" and "disney.com").

Then it was bot-assisted search engines like Google.

Then there was so much content we didn't even know where to start with Google, so we had web rings, then forums, then social media to recommend where to go. Then substack style email newsletters from your chosen taste makers are a half-step further curated from there.

If that is all getting spammed out of existence, I think the next step is an AI filter, you tell the AI what you like and it sifts through the garbage for you.

The reasons we moved past each step are still there, we can't go back, but we can fight fire with fire.

[–] [email protected] 4 points 1 year ago (2 children)

@ajsadauskas @degoogle definitely something o be thinking about. More and more I’m using my followed hashtags, mastodon lists, and links to resources other people provide rather than just finding useful things in search results. But the big gap is still when I want to find quality info on a new topic. Cannot trust any of the damn results searching for how and how often to clean my kid’s new aquarium, for example. So much LLM and SEO crap info.

load more comments (2 replies)

[–] [email protected] 4 points 1 year ago

@ajsadauskas Back when, UW Madison hosted an outfit called The Internet Scout Project that was in the curation business for web resources. The decaying state of search (alternatively the growth of web resources intended to serve interests other than their visitors') has me thinking it would be good to work with public libraries to convene and host this sort of thing.

Librarianship is the right sort of ethos for it, and libraries are infrastructure for human-mediated discoverability.

@degoogle

[–] [email protected] 4 points 1 year ago (1 children)

@ajsadauskas @degoogle I used to be one of those human editors. I was the editor of Scotland.org from about 1994 to about 1997, back in the days when it was exactly one of those hierarchical web directories – with the intention of indexing every website based in Scotland.

[–] [email protected] 4 points 1 year ago

@ajsadauskas @degoogle having said that, the patents on Google's PageRank algorithm have now all expired, and a distributed, co-op operated search engine would now be possible. Yes, there would be trust issues, and you'd need to build fairly sophisticated filters to identify and exclude crap sites, but it might nevertheless be interesting and useful.

[–] [email protected] 3 points 1 year ago (1 children)

@ajsadauskas @degoogle

Yes to all. For a while I've been de facto using a miniscule subset of the web. My gateway to other, relevant websites are via human-to-human recommendations, primarily in a place like this.

load more comments (1 replies)

[–] [email protected] 3 points 1 year ago (5 children)

@ajsadauskas @degoogle

It looks like there's a couple projects to continue the directory DMOZ. I hope they're sharing work with each other!

load more comments (5 replies)

[–] [email protected] 3 points 1 year ago

@ajsadauskas @degoogle DMOZ was once an important part of the internet, but it too suffered from abuse and manipulation for traffic.

For many DMOZ was the entry point to the web. Whatever you were looking for, you started there.

Google changed that, first for the better, then for the worse.

[–] [email protected] 2 points 1 year ago

And now with LLMs, the industrial-scale spamming of the web is likely to grow exponentially.

True, but these things can also be used by us, to curate/maintain a high quality link collection. However, I'm not sure 'pages' will be read by humans in 5 years, so I have a feeling we wont need such a collection anymore. Well, not for humans but probably for our individual LLM's.

[–] [email protected] 2 points 1 year ago

@ajsadauskas @degoogle hopefully they don't look like Dmoz, because i still have unpleasant flashbacks of that dark time 😋

[–] [email protected] 2 points 1 year ago

Just to add to your list of steps and consequences: I also think academic studies about information retrieval, indexing and crawling became less popular. Aspirant students hearing the message: those studies / workfields will become obsolete once AI does all that.

[–] [email protected] 2 points 1 year ago

I remember a time when you could be a paper magazine every other week with curated lists of link on various topics. There were ads, but just paper ads :)

[–] [email protected] 2 points 1 year ago (1 children)

@ajsadauskas @degoogle
I've already seen new webrings forming.

Or maybe that was old webrings updating?

load more comments (1 replies)

[–] [email protected] 2 points 1 year ago

@ajsadauskas @degoogle a bit of history of Yahoo here, started as a web directory https://www.wired.com/1996/05/indexweb/

[–] [email protected] 2 points 1 year ago

What's to say we won't have AI-curated lists and directories? That way we don't have to deal with link rot and the like. I think the issue is the algorithms used for search. We need better ones, better AI, not more frivolous human labor.

[–] [email protected] 2 points 1 year ago

@ajsadauskas @degoogle Curation is elation.

[–] [email protected] 2 points 1 year ago (1 children)

@ajsadauskas @degoogle I mean we could still use all modern tools. I'm hosting a searxng manually and there is currently an ever growing block list for AI generated websites that I regularly import to keep up to date. You could also make it as allow list thing to have all websites blocked and allow websites gradually.

[–] [email protected] 3 points 1 year ago

@ajsadauskas @degoogle I started that because it bothered me that you couldn't just report a website to duckduckgo that obviously was a stackoverflow crawler. This problem persists since reddit and stackoverflow are a thing themselves. why are there no measurements from search engine to get a hold of it.

I never understood that.

[–] [email protected] 2 points 1 year ago (1 children)

@ajsadauskas @degoogle So, classic mid-90s Yahoo. Or LookSmart, which was initially curated by Reader's Digest.

load more comments (1 replies)

load more comments