this post was submitted on 09 Jul 2023
518 points (97.1% liked)
Technology
59711 readers
5789 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.
So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.
What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and "helps" the people who ask for information.
Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.
It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It's currently easy to see if a particular output is plagiarized. https://plagiarismdetector.net/ It's also much easier to enforce. We don't need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.
That could be implemented directly in the software, so it didn't even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.
That's also consistent with how we enforce copyright in humans. There's no law against looking at other people's work and memorizing entire sections. It's also generally legal to reproduce other people's work (eg for backups). It only potentially becomes illegal if someone distributes it and it's only plagiarism if they claim it as their own.
This makes perfect sense. Why aren’t they going about it this way then?
My best guess is that maybe they just see openAI being very successful and wanting a piece of that pie? Cause if someone produces something via chatGPT (let’s say for a book) and uses it, what are they chances they made any significant amount of money that you can sue for?
It's hard to guess what the internal motivation is for these particular people.
Right now it's hard to know who is disseminating AI-generated material. Some people are explicit when they post it but others aren't. The AI companies are easily identified and there's at least the perception that regulating them can solve the problem, of copyright infringement at the source. I doubt that's true. More and more actors are able to train AI models and some of them aren't even under US jurisdiction.
I predict that we'll eventually have people vying to get their work used as training data. Think about what that means. If you write something and an AI is trained on it, the AI considers it "true". Going forward when people send prompts to that model it will return a response based on what it considers "true". Clever people can and will use that to influence public opinion. Consider how effective it's been to manipulate public thought with existing information technologies. Now imagine large segments of the population relying on AIs as trusted advisors for their daily lives and how effective it would be to influence the training of those AIs.