this post was submitted on 26 Jul 2023
868 points (96.5% liked)

Technology

60016 readers
2628 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

Thousands of authors demand payment from AI companies for use of copyrighted works::Thousands of published authors are requesting payment from tech companies for the use of their copyrighted works in training artificial intelligence tools, marking the latest intellectual property critique to target AI development.

you are viewing a single comment's thread
view the rest of the comments
[–] joe 15 points 1 year ago (3 children)

All this copyright/AI stuff is so silly and a transparent money grab.

They're not worried that people are going to ask the LLM to spit out their book; they're worried that they will no longer be needed because a LLM can write a book for free. (I'm not sure this is feasible right now, but maybe one day?) They're trying to strangle the technology in the courts to protect their income. That is never going to work.

Notably, there is no "right to control who gets trained on the work" aspect of copyright law. Obviously.

[–] DandomRude 14 points 1 year ago

There is nothing silly about that. It's a fundamental question about using content of any kind to train artificial intelligence that affects way more than just writers.

[–] FlyingSquid 4 points 1 year ago (1 children)

I seriously doubt Sarah Silverman is suing OpenAI because she's worried ChatGPT will one day be funnier than she is. She just doesn't want it ripping off her work.

[–] joe 2 points 1 year ago (1 children)

What do you mean when you say "ripping off her work"? What do you think an LLM does, exactly?

[–] FlyingSquid 2 points 1 year ago (2 children)

In her case, taking elements of her book and regurgitating them back to her. Which sounds a lot like they could be pirating her book for training purposes to me.

[–] [email protected] 3 points 1 year ago (1 children)

How do you know they didn't just buy the book?

[–] FlyingSquid 0 points 1 year ago

Again, that's not relevant.

[–] joe 2 points 1 year ago (1 children)

Quoting someone's book is not "ripping off" the work.

[–] FlyingSquid 2 points 1 year ago (1 children)

How is it able to quote the book? Magic?

[–] joe 4 points 1 year ago (1 children)

So you're saying that as long as they buy 1 copy of the book, it's all good?

[–] FlyingSquid 1 points 1 year ago (1 children)

No, I'm not saying that. If she's right and it can spit out any part of her book when asked (and someone else showed that it does that with Harry Potter), it's plagiarism. They are profiting off of her book without compensating her. Which is a form of ripping someone off. I'm not sure what the confusion here is. If I buy someone's book, that doesn't give me the right to put it all online for free.

[–] joe 3 points 1 year ago (1 children)

It's not plagiarism if it says it's her book, lol.

What are your feelings on public libraries? And does it spit out the entire book, or just excerpts?

[–] FlyingSquid 1 points 1 year ago (1 children)

I don't think you understand what plagiarism is. When you profit off of someone else's work, you're plagiarizing. Libraries do not profit off of anything. OpenAI, however, is a for-profit endeavor.

[–] joe 1 points 1 year ago (1 children)

plagiarizing

This is taking someone's work and passing it off as your own. Did you not do a simple google search when there was some doubt to the definition, like I just did?

[–] FlyingSquid 2 points 1 year ago (1 children)
[–] joe 0 points 1 year ago* (last edited 1 year ago) (1 children)

Did you read that?

Plagiarism can happen intentionally or unintentionally when a person uses another person's ideas or words without citing the original source. Here are four common forms of plagiarism:

  • Copying another person's words without using quotation marks or referencing the original source
  • Copying an author's words without using quotation marks but using accurate footnotes to the original source
  • Paraphrasing an author's ideas without including a reference to the original source
  • Rearranging an author's exact words, even if there is a footnote to the original source

Oh no, I plagiarized! lol

[–] FlyingSquid 1 points 1 year ago (1 children)

Copying another person’s words without using quotation marks or referencing the original source

ChatGPT can do that.

Copying an author’s words without using quotation marks but using accurate footnotes to the original source

ChatGPT can do that.

Paraphrasing an author’s ideas without including a reference to the original source

ChatGPT can do that.

Rearranging an author’s exact words, even if there is a footnote to the original source

ChatGPT can do that.

[–] joe -1 points 1 year ago (1 children)
[–] FlyingSquid 2 points 1 year ago (1 children)

Yes, can. It is capable of doing all those things and, again, if she is correct, will do so if prompted.

[–] joe 0 points 1 year ago (1 children)

I think this is nonsense, but you're saying the issue is that it doesn't use quotes when someone asks it to quote a passage from her book? Is that true?

[–] FlyingSquid 2 points 1 year ago (1 children)

Nope, again, the issue is that it can regurgitate the entire book if prompted. Why you think that's legal is beyond me. What if it had video. Should it be allowed to spit out all of Oppenheimer if prompted?

[–] joe 0 points 1 year ago* (last edited 1 year ago)

Can it?

Prompt:

quote me a passage from a popular Sarah Silverman book

Response:

As an AI language model, my knowledge was last updated in September 2021, and I do not have access to real-time data, including specific passages from books released after that date. Up until my last update, Sarah Silverman, the American comedian, actress, and writer, had not released a book.

However, Sarah Silverman has written a memoir titled "The Bedwetter: Stories of Courage, Redemption, and Pee," which was published in 2010. Since I don't have the contents of the book available to me directly, I can't provide a specific passage from it.

If you're interested in reading something from the book, I recommend checking it out from your local library or bookstore. Her memoir contains personal anecdotes and humorous stories, reflecting her unique comedic style and life experiences.

Edit: ChatGPT-3.5, if that matters.

[–] [email protected] 2 points 1 year ago (1 children)

Designing and marketing a system to plagiarize works en masse? That's the cash grab.

[–] joe 4 points 1 year ago (1 children)

Can you elaborate on this concept of a LLM "plagiarizing"? What do you mean when you say that?

[–] [email protected] 0 points 1 year ago (1 children)

What I mean is that it is a statistical model used to generate things by combining parts of extant works. Everything that it "creates" is a piece of something that already exists, often without the author's consent. Just because it is done at a massive scale doesn't make it less so. It's basically just a tracer.

Not saying that the tech isn't amazing or likely a component of future AI but, it's really just being used commercially to rip people off and worsen the human condition for profit.

[–] joe 3 points 1 year ago (1 children)

Everything that it “creates” is a piece of something that already exists, often without the author’s consent

This describes all art. Nothing is created in a vacuum.

[–] [email protected] 0 points 1 year ago (1 children)

No, it really doesn't, nor does it function like human cognition. Take this example:

I, personally, to decide that I wanted to make a sci-fi show. I don't want to come up with ideas so, I want to try to do something that works. I take the scripts of every Star Trek: The Search for Spock, Alien, and Earth Girls Are Easy and feed them into a database, seperating words into individual data entries with some grammatical classification. Then, using this database, I generate a script, averaging the length of the films, with every word based upon its occurrence in the films or randomized, if it's a tie. I go straight into production with "Star Alien: The Girls Are Spock". I am immediately sued by Disney, Lionsgate, and Paramount for trademark and copyright infringement, even though I basically just used a small LLM.

You are right that nothing is created in a vacuum. However, plagiarism is still plagiarism, even if it is using a technically sophisticated LLM plagiarism engine.

[–] joe 1 points 1 year ago (1 children)

ChatGPT doesn't have direct access to the material it's trained on. Go ask it to quote a book to you.

[–] [email protected] 2 points 1 year ago (1 children)

That really doesn't make an appreciable difference. It doesn't need direct access to source data, if it's already been transferred into statistical data.

[–] joe -1 points 1 year ago (1 children)

It does rule out "plagiarism", however, since it means it can't pull directly from any training material.

I should have asked earlier: what do you think plagiarism is?

[–] [email protected] 2 points 1 year ago (1 children)

It really doesn't. The data is just tokenized and encoded into the model (with additional metadata).

If I take the following:

Three blind mice, three blind mice See how they run, see how they run

And encode it based upon frequency: 1:{"word": "three", "qty": 2} 2:{"word": "blind", "qty": 2} 3:{"word": "mice", "qty": 2} 4:{"word": "see", "qty": 2} 5:{"word": "how", "qty": 2} 6:{"word": "they", "qty": 2} 7:{"word": "run", "qty": 2}

The original data is still present, just not in its original form. If I were then to use the data to generate a rhyme and claim authorship, I would both be lying and committing plagiarism, which is the act of attempting to pass someone else's work off as your own.

Out of curiosity, do you currently or intend to make money using LLMs? I ask because I'm wondering if this is an example of Upton Sinclair's statement "It is difficult to get a man to understand something when his salary depends on his not understanding it."

[–] joe 0 points 1 year ago* (last edited 1 year ago) (1 children)

That's not how LLMs work, and no, I have no financial skin in the game. My field is software QA; I can't nail down whether it would affect me or not, because I could imagine it going either way. I do know that it doesn't matter-- legislation is not going to stop this-- it's not even going to do much to slow it down.

What about you? I find that most the hysteria around LLMs comes from people whose jobs are on the line. Does that accurately describe you?

Edit: typos

[–] [email protected] 2 points 1 year ago (1 children)

It is not literally how they work, no. But, an oversimplified approximation. Data is encoded into mathematical functions in neural network nodes but, it is still encoded data in the same way that an MP3 and WAV of a song are both still the song; the neural network is the medium.

Just because the data is stored in a different, possibly more-efficient manner doesn't mean that it is not there for all intents and purposes (I suppose one could make the argument of it being transformed into metadata but if it is able to reconstruct verbatim, this seems like a fallacy). Nor is it within free use exemptions of most IP laws to use others' copyrighted, trademarked, or copy-left data to power a commercial product in ways contrary to licensing terms.

As for my job, well, yes, I do have some anxieties in that area but as a software engineer focused in automation, tooling, and security, I suspect that my position is fairly secure. I would hope yours is too, both for youself and overall software quality. Likely there will be more demand for both of our skillsets with the CRA.

[–] joe 0 points 1 year ago (1 children)

Data is encoded into mathematical functions in neural network nodes but, it is still encoded data in the same way that an MP3 and WAV of a song are both still the song; the neural network is the medium.

Here: https://www.understandingai.org/p/large-language-models-explained-with

It's not plagiarism by any definition of the word that makes sense; while the analogy may not be literal, it is perfectly analogous to suggest that learning new words from a Harry Potter book means that any book you write going forward is plagiarizing JK Rowling; the training data helps map the words in the model-- it's never used as a blueprint when predicting what word comes next in any given scenario. It's even farther away from copyright infringement-- there is no limited right granted that allows a IP holder to say how that IP can be processed. That's just not a thing. You'd have just as much leg to stand on if you suggested that Stephen King had the right to prevent people from reading his books in a room with green walls. You can't just make up new rights. Trademark law is totally insane. I don't know why you even mention it. It doesn't even have the same goals as the others.

as a software engineer

I am not so sure that this specific role is in any way secure, myself. You may come to the same conclusion after reading that link I provided-- pay attention to how rapidly the LLMs are growing in complexity. I do not wish for anyone to lose their financial security, even a stranger like you, but I can't help but look at the available information and come to that conclusion.

[–] [email protected] 1 points 1 year ago (1 children)

there is no limited right granted that allows a IP holder to say how that IP can be processed.

There very much is. Literally all intellectual property law concerns how intellectual property may or may not be used and licensed. For example, one may not record and sell a cover of a song that is in copyright without explicit permission in the form of a mechanical license. In our industry, one may not use code that is covered by a GNU GPL license without fulfilling the source code distribution requirements (see: IBM RedHat drama).

The training data is what gives the LLM value in the problematic situations so, it is very clear that the material is a key component in the business plan and commercial use. This is not an educational, parody, or other exempt fair-use activity. This means that if any data used for training is not licensed appropriately, such use is a clear violation of intellectual property laws, even if but explicitly covered due to the technology not existing when they were written.

I am not so sure that this specific role is in any way secure, myself. You may come to the same conclusion after reading that link I provided-- pay attention to how rapidly the LLMs are growing in complexity. I do not wish for anyone to lose their financial security, even a stranger like you, but I can't help but look at the available information and come to that conclusion.

I do agree that there are software engineering jobs at risk in the short-term due to management desire to cut labor while riding the hype train as well as US taxation on R&D but, given the widespread failures found when companies have replaced engineers and others, I have been expecting wave of desperate re-hiring to occur in 1-3 years after layoff. The particular segment that I'm involved in is generally considered high-ROI so, likely less vulnerable (but no guarantee).

I don't see how QA could be sanely replaced though as, from my experience, it's already frequently under-funded and, as I mentioned, for all the bad in the CRA drafts, one of the positives is that QA-related work is going to be mandatory for software and devices sold in the EU market.

[–] joe 1 points 1 year ago (1 children)

Sorry about the late reply-- I try my best to stay mostly disconnected from the internet on the weekends.

Literally all intellectual property law concerns how intellectual property may or may not be used and licensed.

True, but no IP law gives the IP holder the power you're trying to give them. That is what I'm saying. It would need the law to be changed. There is no aspect of IP law that says that you aren't allowed to use the text to train anyone, let alone a LLM.

The training data is what gives the LLM value in the problematic situations so, it is very clear that the material is a key component in the business plan and commercial use.

This does not matter. If I read a book on Six Sigma business practices and then use that knowledge to better structure my business to increase my profits, I don't owe the author of the book anything from that. You're, again, trying very hard to give away your own rights in order to stick it to LLMs. I'm positive IP rights holder would love this new right you want to give them. Perhaps reconsider the implications, though. Simply making money off of the information found in a book does not give the author rights to that money.

Let me ask you this. If you have a epub of a book on your computer and you select it and press Ctrl-C, Ctrl-V-- have you violated copyright laws? You've made a copy, after all.

[–] [email protected] 1 points 1 year ago

No worries! Definitely important to have healthy relationships with device usage.

Your statements on rights of IP owners seems to imply that the vast majority of open-source licenses are meaningless. It also seems to be a parallel to the legal cases brought by the family of Henrietta Lacks, who's cells were cultured without her consent and have been used extensively in research and pharmaceutical development, bringing in significant profits, while neither Ms. Lacks, nor her family saw a dime.

Not 1:1, as Lacks involves human research and biomedical research but, it certainly rhymes as there is a lack of consent and unshared profits.

Let me ask you this. If you have a epub of a book on your computer and you select it and press Ctrl-C, Ctrl-V-- have you violated copyright laws? You've made a copy, after all

Depending on the use, possibly. If I intended to sell copies of it, almost definitely. Likewise if I intended to create derivative art that did not fall within the bounds of fair use without attributing credit or license from the holder of the copyright.

Speaking from an ethical, rather than purely legal perspective, profiting off of training an LLM or similar neural network on someone else's work, in a manner that competes with the source work, without their permission or giving them a share of the proceeds, is hard to imagine as ethical in any manner that does not involve extraordinary mental gymnastics.

On the other hand, I would not see anything wrong with doing so for one's personal enjoyment, if there is no harm done to the IP owner.