this post was submitted on 13 Jan 2024
55 points (87.7% liked)
Fediverse
29340 readers
1787 users here now
A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).
If you wanted to get help with moderating your own community then head over to [email protected]!
Rules
- Posts must be on topic.
- Be respectful of others.
- Cite the sources used for graphs and other statistics.
- Follow the general Lemmy.world rules.
Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Radical and altogether stupid idea (but a fun thought) is this:
Were lemmy to have a certain percentage of AI content seamlessly incorporated into its corpus of text, it would become useless for training LLMs on (see this paper for more technical details on the effects of training LLMs on their own outputs, a phenomenon called "model collapse").
In effect this would sort of "poison the well", though given that we all drink the water, the hope would be that our tolerance for a mild amount of AI corruption would be higher than an LLM creator's.
This poisoning approach amusingly benefits from being a thing that could be advertised heavily, basically saying "lemmy is useless for training LLMs, don't bother with it".
Now I must say personally I think that I don't really think this is a sensible or viable strategy, and that I think the well is already poisoned in this regard (as I think there is already a non-negligible amount of LLM-sourced content on lemmy). But yes, a fun approach to consider: trading integrity for privacy.
Those "@-@ tailed jackrabbits" in your link made me laugh. Emoticons in species names? Why not?
I think that we could minimise the loss of integrity if the data is "contained" in a way that your typical user wouldn't see it but bots would still retrieve it for model training.
And we don't need to restrict ourselves to use LLM-sourced data for that. The model collapse boils down to the amount of garbage piling up over time; if we use plain garbage we can make it even worse, as long as the garbage isn't detected as such.
Yeah as an ecologist that same thing made me giggle. I suppose why not the lesser-spotted 🍆warbler :P
In terms of exposing it only to bots, that is a frustration, unless you make it seamless then it does become kinda trivial to mitigate. Otherwise the approach I'd take to mitigate it is to adapt a lemmy client that already does the filtering or reverse-engineer the deciding element of the app. Similarly if you use garbage then you need it to look enough like normal words for it to be hard to classify as AI generated.
The funny thing is that LLMs are not actually much good at telling whether something is ai generated, you need to train another model to do that, but to train that ai you need good sources of non-corrupt data. Also the whole point of generative AI language models is that they are actively trying to pass that test by design so it becomes an arms race that they can never really win!
Man, what a shitshow generative ai is