this post was submitted on 08 Aug 2023
969 points (97.7% liked)
Privacy
31993 readers
550 users here now
A place to discuss privacy and freedom in the digital world.
Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.
In this community everyone is welcome to post links and discuss topics related to privacy.
Some Rules
- Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
- Don't promote proprietary software
- Try to keep things on topic
- If you have a question, please try searching for previous discussions, maybe it has already been answered
- Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
- Be nice :)
Related communities
Chat rooms
-
[Matrix/Element]Dead
much thanks to @gary_host_laptop for the logo design :)
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It is up to whoever runs the ai, and those are the people I'm addressing for the most part, though plenty of websites do have control over what data is fed to the ai they're using. In grammarly's case it's absolutely up to them whether there's an option provided to opt out of having your work used for training the ai, as shown by the fact that they offer it to the business license. They just choose not to offer that option to other users.
It's all code, the people coding it are 100% capable of programming it to keep track of where the information comes from. Even if it's transformative, that doesn't prevent it from keeping track of what was transformed.
According to who? There are plenty of ways to get data from voluntary sources just like we get for any number of studies. It's just up to the one who runs the ai to put in the legwork to get enough data that way, and there are lots of methods. You don't have to just sit and wait for people to come to you and sign up, though based on the ai frenzy I net they could have gotten plenty of data that way from people who are curious and want to contribute to ai training as a novel new concept. Making ai data gathering on websites something people can opt in or out on is just one way of making it more ethical than forcibly taking that data without permission.
I fail to see how requiring permission and offering the option to opt out of having your data used would benefit corporations. That just sounds like an excuse to not even try to regulate them.
I don't understand how part A leads to part B here. Why would those corporations have an advantage just because everyone with ais, including them, have to offer the option to opt out? Also, it's entirely possible to also restrict the scope of an ai or regulate ai monopolies alongside regulating stuff like basic consent. Historically a lack of regulation is what causes corporate hellscapes because without something keeping them in check the larger companies will take advantage of their reach to do whatever they want on a larger scale, pushing out or merging with competitors. It's not like requiring permission and providing opt-out would give them more of an advantage than they already have.
This is a fundamental misunderstanding of how LLMs actually work. Given a list of previous tokens, a complicated set of linear algebra and normalization operations are applied to yield the “probability” (in quotes because this is a dubious application of the word imo) that each known token will follow it. The model is trained using an equally complicated regression algorithm that slowly adjusts the billions of linear algebra coefficients to more closely match the training data. RLHF is then used to make more adjustments that allow the AI to fulfill its intended purpose (e.g., to reinforce the question-answer format expected of ChatGPT).
You may recall regression from your first statistics class. Even in the case of simple linear regression, when the input consists of millions of data points, it is essentially impossible to determine which point should be “credited” for any given aspect of the output line. The same is true for AI: you could maybe compile a list of training data that makes a token “likely” to appear after another token, but nothing more complex than that. It is very rare for a small set of sources to be responsible for a sequence longer than a few tokens.
I do, however, believe they should be required to provided a very specific list of sources used for training the model. I think it’s ridiculous to claim that generative AI is transformative in a practical sense: I can’t imagine it would be legal for companies to make endless photocopies of copyrighted material and have a computer make fancy scrapbooks out of it, even if “it’s a fledgling industry” or whatever.
It depends for what kind of AI and but no, giving sources and building with just volunteer data is just not possible at our current technological level. I'm mostly talking about large llms because that's what's really at stake and they train on huge amounts of data. Like ALL of stack, GitHub, Reddit, etc. Just fine tuning them on a consumer level takes more than 50 000 question and answer pairs, that's just one tiny superficial layer that's added on top.
Grammerly should absolutely add an opt out option to gain consumers trust, but forcing the the whole industry to do so is a disaster.
If individuals can opt out, so will websites to "protect their users". Then we get data hoarding, where stack and GitHub opt out of all open source options but sell it to the only ones that can now afford to build ais, Microsoft and google. it won't include data of certain individuals, the few that opt out, but I'm guessing eventually the opt in will be directly into the terms of service of websites, you opt in or you fuck off.
How does anyone except corporations benefit from this kind of circus. In 10 years, AI will be doing most office work. Google isn't dumb and wants that profit. They and openai have all the data, they can strong arm or buy what they are missing. Restricting and legislating only widens their moat.