this post was submitted on 04 Sep 2024
185 points (96.0% liked)

ChatGPT

8949 readers
1 users here now

Unofficial ChatGPT community to discuss anything ChatGPT

founded 1 year ago
MODERATORS
all 30 comments
sorted by: hot top controversial new old
[–] [email protected] 48 points 2 months ago (4 children)

57% of all content is AI generated?? Hard to believe tbh.

[–] [email protected] 37 points 2 months ago (1 children)

Are we maybe talking about 57% of newly created content? Because I also have a very hard time believing that LLM generated content already surpassed the entire last few decades of accumulated content on the internet.

[–] [email protected] 15 points 2 months ago* (last edited 2 months ago)

I'm too dumb to understand the paper, but it doesn't feel unlikely that this is a misinterpretation.

What I've figured out:

  • They're exclusively looking at text.
  • Translations are an important factor. Lots of English content is taken and (badly) machine-translated into other languages to grift ad money.

What I can't quite figure out:

  • Do they only look at translated content?
  • Is their dataset actually representative of the whole web?

The actual quote from the paper is:

Of the 6.38B sentences in our 2.19B translation tuples, 3.63B (57.1%) are in multi-way parallel (3+ languages) tuples

And "multi-way parallel" means translated into multiple languages:

The more languages a sentence has been translated into (“Multi-way Parallelism”)

But yeah, no idea, what their "translation tuples" actually contain. They seem to do some deduplication of sentences, too. In general, it very much feels like just quoting those 57.1% without any of the context, is just a massive oversimplification.

[–] chonglibloodsport 6 points 2 months ago

I think if you include scraped/plagiarized SEO spam “content” then I totally believe it. The amount of that crap flooding the internet is staggering. Search is just becoming more and more useless every day.

[–] [email protected] 27 points 2 months ago* (last edited 2 months ago) (1 children)

If current copyright law dies at the hand of AI then so be it.
Cause it desperately needs to die.

[–] [email protected] 28 points 2 months ago* (last edited 2 months ago) (2 children)

Not like this. Not like this.

Independent creators need some sort of protection from giant corporations.

[–] [email protected] 17 points 2 months ago

Copyright isn't meant to help independent creators. At least not small ones. You have to pursue legal action against people to enforce it. Small creators do not have the money for that.

[–] pennomi 3 points 2 months ago

Small creators have far more to gain than lose by loosening copyright regulations. Hell, I know multiple artists whose primary source of income is illegal fanart.

[–] CarbonAlpine 23 points 2 months ago (1 children)
[–] [email protected] 2 points 2 months ago
[–] gmtom 16 points 2 months ago (2 children)

Am I stupid or are the two statements in the title completely unrelated?

[–] [email protected] 4 points 2 months ago

Ironically this seems like an AI post lmao

[–] elbarto777 4 points 2 months ago

You are not stupid.

[–] [email protected] 14 points 2 months ago

AI trainers curate the data they use for training. We've gone past the phase where people just dump Common Crawl onto a neural net and tell it "figure that out somehow!" That worked back when we had no idea what we were doing or what would produce passable results, nowadays we know what produces better results. "Model collapse" has been known as a potential problem for years. The studies demonstrating it use unrealistic training methodologies to force it to extremes, real training works to avoid it.

And finally, that "57% of content is AI-generated!" Headline that's been breathlessly spamming all the feeds? Grossly misleading, of course. The actual study found that 57% of the content in their sample that had been translated into other languages had been translated into three or more languages, which they interpreted as meaning it had been AI-translated.

People are so eager to click on "AI sucks and is dying!" headlines.

[–] theRealBassist 5 points 2 months ago

The headline has literally nothing to do with the paper it is citing.

The paper is specifically looking at mat machine translation, not generation.

Nowhere does it state that 57% of content is AI generated.

[–] TommySoda 5 points 2 months ago

Well if plagiarizing the entire internet is required, the least these models could do is provide sources. Hold AI to the same standards as everything else. AI can't create anything new so it might as well tell me where it's coming from. That way if an AI model is telling me to eat rocks I can at least see that it's from a reddit post. And if it creates an image based on a prompt I'd like to know which images it used to generate the "new" image. With AI sucking up and regurgitating more and more of the Internet it's only going to get more difficult to determine what's factual information.

Hopefully this will be resolved before someone that isn't very tech savvy gets hurt or worse because they decided to trust Google's AI search. People like your parents or grandparents that barely understand how a smartphone works and are now being fed potentially dangerous information with no sources of where it came from.

[–] [email protected] 3 points 2 months ago (1 children)

if it's impossible to train AI without abusing copyright why do I see job postings looking for writers to train AI 🤔

[–] TommySoda 5 points 2 months ago

Because these models sucked up everything else and they are still hungry. When they said there would be new jobs, working for AI wasn't quite what I had imagined. I thought it was supposed to be the other way around.

[–] [email protected] 2 points 2 months ago* (last edited 2 months ago)