this post was submitted on 21 Jan 2025

159 points (99.4% liked)

Fuck AI

1781 readers

354 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 10 months ago

MODERATORS

BigMikeInAustin

CriticalMedicine

WonderfulWanderer

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Sterile_Technique

[email protected]

[email protected]

159

iocaine - The deadliest poison known to AI (git.madhouse-project.org)

submitted 1 week ago by [email protected] to c/fuck_ai

15 comments fedilink hide all child comments

all 16 comments

sorted by: hot top controversial new old

[–] [email protected] 46 points 1 week ago (1 children)

Oh hell yeah.

Months ago I was brainstorming something almost identical to this concept: use the reverse proxy to serve pre-generated AI slop to AI crawler user agents while serving the real content to everyone else. Looks like someone did exactly that, and now I can just deploy it. Fantastic.

[–] [email protected] 8 points 1 week ago (1 children)

Ai slop is actually better than random data because it gets in a feedback loop which is more destructive.

[–] [email protected] 4 points 1 week ago (1 children)

If you use natural text to train model A, and then use model A's output, a, to train model B, then model B's output will be less good than model A's output. The quality degenerates with each generation, but the it happens over generations of models. So, random data is worse than AI slop, because random data is already of the lowest possible quality for AI training.

[–] [email protected] 1 points 1 week ago

Yes, but random data might be easier to detect in the first place, and could then be filtered.

[–] [email protected] 33 points 1 week ago

Poison the AI. I'm all for it.

[–] [email protected] 13 points 1 week ago

Why is no one talking about the fact that the demo is clearly using the Bee movie script to power the Markov Chain generation?

This thing spits out some gold:

Honey, it changes people.

I'm taking aim at the baby.

[–] arken 11 points 1 week ago (1 children)

Would this interfere with legitimate crawlers as well, the Internet Archive for instance?

[–] RememberTheApollo_ 1 points 1 week ago (1 children)

Could you list specific crawlers to be automatically blocked by the iocaine site?

[–] [email protected] 11 points 1 week ago (1 children)

So it's like nightshade for LLMs?

[–] [email protected] 14 points 1 week ago

Better, actually. This feeds the crawler a potentially infinite amount of nonsense data. If not caught, this will fill up the whatever storage medium is used. Since the data is generated using Markov-chains, any LLM trained on it will learn to disregard context that goes farther back than one word, which would be disastrous for the quality of any output the LLM produces.

Technically, it would be possible for a single page using iocaine to completely ruin an LLM. With nightshade you'd have to poison quite a number of images. On the other hand, Iocaine text can be easily detected by a human, while nightshade is designed to not be noticeable by humans.

[–] [email protected] 5 points 1 week ago (1 children)

I'm not sure I fully understand.

This generates garbage if it thinks the client making the request is an ai crawler. That much I get.

What I don't understand is when it talks about trapping the crawler. What does that mean?

[–] [email protected] 23 points 1 week ago

Simply put, a crawler reads a site, takes note of all the links in the site then reads all of these sites, again notes all the links there, reads those, etc. This website always and only links to internal resources which were randomly generated and again only link to other randomly generated sources, trapping the crawler if it has no properly configured exit condition.

[–] [email protected] 3 points 1 week ago

How hard would this be for a sophisticated enough bot to detect the intention here, and blacklist the domain on a shared blacklist set? I would imagine not too difficult. Good idea, though. The start of something potentially great.

[–] [email protected] 2 points 1 week ago (1 children)

Don't these crawlers save some kind of metadata before fully committing it to their databases? It'd surely be able to see that a specific domain served just garbage (and/or that it's so "basic"), and then blacklist/purge the data? Or are the AO crawlers even dumber than I'd imagine?

[–] Hackworth 5 points 1 week ago* (last edited 1 week ago)

I'd be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3's initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there's a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug