this post was submitted on 30 Dec 2023
20 points (85.7% liked)

Journalism

22 readers
1 users here now

For all things related to the industry and practice of journalism **Click here to support the development of Kbin**

founded 1 year ago
 

In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.

Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).

But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.

(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).

Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.

top 7 comments
sorted by: hot top controversial new old
[–] FireTower 6 points 10 months ago (2 children)

The author in my opinion misrepresents the stance of the NY Times here.

It’s a false belief that reading something (whether by human or machine) somehow implicates copyright.

The Times issue isn't just that someone or thing is reading materials. The Times takes issue with a group intentionally enmass collecting large amounts of their data (in their case articles) with the intention of distributing them packed into a product to 3rd parties engaging in commercial activities without paying a licensing fee. The Times fears that them doing this damages the potential market for future and past articles from them.

In essentially the Times fears that Common Crawl is acting a fence for other groups to infringe on their copyrighted works.

Factors of Fair Use:

  1. The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
  2. The nature of the copyrighted work.
  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole.
  4. The effect of the use upon the potential market for or value of the copyrighted work.
[–] PastaGorgonzola 5 points 10 months ago (1 children)

I see what you mean, but I thought copyright is a protection against copying something (even with some modifications).

Techdirt traditionally has a very clear view on copyright and its restrictions, so I am familiar with their bias. Their argument here boils down to the difference between copying something and learning from something. If reading something and learning from it is copyright infringement, any educational institute should be very worried. Because that's exactly what's going on in there.

I do understand the difference between a student reading dozens/hundreds of NYT articles (for free in the library) and a computer program doing the same, but for orders of magnitude more articles. So I'm curious to see how this is going to turn out

[–] FireTower 4 points 10 months ago

You raise an interesting issue with learning. I would say as humans we have the capacity to add creative input into our works where as a program can only restructure and regurgitate information entered.

And to be clear I don't opposed the creations of AI in general I just think that creatives, especially independent artists, deserve to be justly compensated if they choose to allow AI to train on their works.

[–] littlebluespark -3 points 10 months ago

The Times articles are not "packed into a product", FFS. How is this so hard for people to grasp? The simple act of parsing data changes it. If digesting media is theft, then every single meme is piracy, and every person who's ever been to a museum, watched a play, or a movie, or read a book, is guilty of "stealing" copyrighted material every single time they've done so.

This is genuinely mind-boggling how do many find this basic, crystal clear concept so fucking challenging to grasp.

[–] LemmyIsFantastic 5 points 10 months ago* (last edited 10 months ago) (1 children)

Holy Christ, this. This is what people are missing. All of these suits bring bright up against AI boil down to this and unless the law changes (not implying either way) these suits are dumb.

Me using your public works and deriving my own, machine helped or not, has never been protected.

[–] FireTower 3 points 10 months ago* (last edited 10 months ago) (1 children)

That depends on the nature of the derivative and the license the original work was made under. Fair use is an exception to copy right laws and it's applicability depends on several factors.

If you publish a photo under a non commercial use license the NY Times can't just publish a cropped black and white version of it in their paper without arranging a deal with you.

But someone else could write a blog post critiquing your photo, and show the photo in the process.

The contention is around if AI tools are meeting the Fair Use standard.

[–] littlebluespark -2 points 10 months ago* (last edited 10 months ago)

Your initial example is poorly constructed as it implies that, much like republishing a cropped section of an original photo, AI is "generating" its results by merely stitching quotes together. That could not be further from the truth, and perpetuating that misconception is irresponsible and unhelpful.

It would be a more accurate analogy to describe an original photograph as one in a volume compiled to refine specific visual details like building structure/style of a specific place, fashion of an era, photography style, etc., etc. to better enable the LLM's text-to-graphic mechanics.

edit: thanks to the silent cowardly anon for the downvote seconds after posting. jog on, li'l sweaty thing, jog on.