this post was submitted on 12 Jun 2023
14 points (100.0% liked)

Experienced Devs

3981 readers
1 users here now

A community for discussion amongst professional software developers.

Posts should be relevant to those well into their careers.

For those looking to break into the industry, are hustling for their first job, or have just started their career and are looking for advice, check out:

founded 2 years ago
MODERATORS
 

cross-posted from: https://lemmy.world/post/76533

One of the arguments made for Reddit's API changes is that they are now the go to place for LLM training data (e.g. for ChatGPT).

https://www.reddit.com/r/reddit/comments/145bram/addressing_the_community_about_changes_to_our_api/jnk9izp/?context=3

I haven't seen a whole lot of discussion around this and would like to hear people's opinions. Are you concerned about your posts being used for LLM training? Do you not care? Do you prefer that your comments are available to train open source LLMs?

(I will post my personal opinion in a comment so it can be up/down voted separately)

you are viewing a single comment's thread
view the rest of the comments
[โ€“] [email protected] 7 points 1 year ago (1 children)

I do not want my content to contribute to propertiery LLM that will make billion for large tech company without giving back to the community. Unfortunately I think fediverse have a harder time countering large scale data harvesting than a centralized service like reddit.

On the other hand, I don't mind open source, privacy respecting (is this a thing for LLM?) LLM to use my content.

[โ€“] FearTheCron 1 points 1 year ago

I am also wary of big tech companies using my comment history for their LLMs. However, I worry that the tech companies will scrape data anyway and Reddit's API pricing just locks out the open source LLMs. There are a few of them, a couple that I have played with:

https://github.com/nomic-ai/gpt4all

https://github.com/ggerganov/llama.cpp

Some projects even try to preserve privacy. But I think its more on the side of what extra training data you give it and the queries you issue.

https://github.com/imartinez/privateGPT