Technology

61113 readers

2482 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

467

Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI (www.404media.co)

submitted 5 months ago by [email protected] to c/technology

72 comments fedilink hide all child comments

https://archive.is/2024.08.05-162750/https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 5 months ago (1 children)

I'm saying using code for training is a different issue that copyright infringement. I edited my post above to better lay out my position.

[–] trollbearpig -1 points 5 months ago* (last edited 5 months ago) (1 children)

And that's the whole point of my comment, did you even read it? To summarize, there is currently a loophole in law that allows these bullshit arguments about it being different than straight up copying shit (though this haven't been litigated yet, so it's not yet clear if these arguments are actually valid). This means that while a person reading my AGPL code and copying it (without following the license) is 100% illegal, doing the same through an LLM may be legal. So this means that open source licenses can be bypassed by first training an LLM with the code and then extracting the code from the LLM. This is terrible for open source, and in general for anyone who wants to make a living from creating copyrighted work. So we should close this loophole, and I'm glad there is a push to close this through better laws. Even if these laws are comming from Disney, Sony, and all those awful companies.

So again, what's the point you are trying to make here? That we shouldn't make these laws stronger to prevent this bullshit? I honestly don't understand what you are trying to argue here, nothing of what you have said has anything to do with this conversation.

[–] [email protected] 2 points 5 months ago (1 children)

That we already have laws that protect copyright infringement (which seem like they would still apply if it was spit out by an LLM or not), and no more should be made. That training on public data is fine.

[–] trollbearpig -1 points 5 months ago (1 children)

Any arguments to defend your position? I'm giving you a very clear example of the awful consecuences of following that path. And the same applies to any creative work. You are just being dismissive without proposing any real solution. Do better man.

[–] [email protected] 2 points 5 months ago

The EFF link I posted above provides evidence. Again, here's a quote from part of it:

The process of machine learning for generative AI art is like how humans learn—studying other works—it is just done at a massive scale. Huge swaths of data (images, videos, and other copyrighted works) are analyzed and broken into their factual elements where billions of images, for example, could be distilled into billions of bytes, sometimes as small as less than one byte of information per image. In many instances, the process cannot be reversed because too little information is kept to faithfully recreate a copy of the original work.

As I mentioned before, Copilot at least, helps people avoid copyright infringement by notifying you if your code is similar to public code. The solution I'm proposing is no new laws, and just enforcing the ones we have. Most of the laws being proposed look like attempts at regulatory capture to me.