this post was submitted on 14 Jul 2023
122 points (93.0% liked)

Technology

34437 readers
168 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
 

Shit in -> shit out ๐Ÿ“ค

you are viewing a single comment's thread
view the rest of the comments
[โ€“] kromem 25 points 1 year ago* (last edited 1 year ago)

I think people may be confused about what this is saying, so an example might help.

Remember when Stable Diffusion first came out and you could spot AI generated images as if they killed your father and should be prepared to die?

Had those six+ digit monstrosities been fed back into training the model, you'd have quickly ended up with models generating images with even worse hands from hell.

But looking at a study like this and worrying about AI generated content poisoning the Internet for future training is probably overblown.

Because AI content doesn't just find its way onto the web directly the way it is in this (and the Stanford) study. Often a human is selecting from multiple outputs to decide what to post, or even if it is directly posted, humans are voting content up or down based on perceived quality.

So practically, if models were being trained recursively on popular content online that had been generated by AI, it wouldn't be content that overfits spider hands or striped faces or misshapen pupils or repetitive text or broken links or any other number of issues generative AI currently has.

Because of the expense in human review of generated content this and the previous paper aren't replicating the circumstances that real world recursive training of a mixed human and AI Internet would represent, and the issues which arose will likely be significantly muted in real world circumstances outside the lab.

TL;DR: Humans filtering out six fingered outputs (and similar) as time goes on is a critical piece of the practical infrastructure which isn't being represented, and this is simply a cautionary tale against directly piping too much synthetic data back into training.