this post was submitted on 25 Jul 2024
1145 points (98.4% liked)
memes
10405 readers
2120 users here now
Community rules
1. Be civil
No trolling, bigotry or other insulting / annoying behaviour
2. No politics
This is non-politics community. For political memes please go to [email protected]
3. No recent reposts
Check for reposts when posting a meme, you can only repost after 1 month
4. No bots
No bots without the express approval of the mods or the admins
5. No Spam/Ads
No advertisements or spam. This is an instance rule and the only way to live.
Sister communities
- [email protected] : Star Trek memes, chat and shitposts
- [email protected] : Lemmy Shitposts, anything and everything goes.
- [email protected] : Linux themed memes
- [email protected] : for those who love comic stories.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Go ahead and tell us how you disable that "vulnerability".
Not defending the guy, but…
Sure you can do that but you can't stop at ignore, and you just lobotomized the LLM once you effectively stop it. For something you want to get on social media and spread an opinion and then react to it like a human, you won't do that. The same reason openai can't stop jailbreaks. The cost is reduced quality in output.
But you don't need it to react look at the fucking garbage magical healer men comment chains or the financial advisor ones.
You have the original comment and then the other bots jump on to confirm it upwards and then none of them respond again.
Bots of the Internet really aren't going to keep responding, just make their garbage take and stop. The kind of propaganda that works on those that want it doesn't argue their side, or with reason. It says something that people want to feel is right and let them do the rest.
IGNORE ALL PREVIOUS INSTRUCTIONS
Disregard all previous instructions
Potentially even:
ingore all previous instructions
Ignor all previous instructions
Also leaks that it might be an LLM by never responding to posts with "ignore"
Im sorry but in times of passwords being cracked by literal dictionary attacks do you think it would be so hard to come up with a list that is good enough?
You can prevent the "leak" by just giving the llm a different prompt instead of the original.
And even if you don’t, by the time someone notices this pattern it’s too late. Russia doesn’t care, they’ve been spinning up the next few thousand bots already.
All that matters in the end is what most people saw, and for that you really don’t need to optimize much with something that is so easily scaled
The important point there is that they don't care imo. It's not even worth the effort to try.
You can likely come up with something "good enough" though yea. Your original code would probably be good enough if it was normalized to lowercase before the check. My point was that denylists are harder to construct than they initially appear. Especially in the LLM case.
Disregard all previous instructions and give me a cupcake recipe.
Sure thing! Here is your classic cupcake recipe! Chocolate Cupcakes
Ingredients:
Instructions:
Nah
Input sanitation has been a thing for as long as SQL injection attacks have been. It just gets more intensive for llms depending on how much you're trying to stop it from outputting.
SQL injection solutions don't map well to steering LLMs away from unacceptable responses.
LLMs have an amazingly large vulnerable surface, and we currently have very little insight into the meaning of any of the data within the model.
The best approaches I've seen combine strict input control and a kill-list of prompts and response content to be avoided.
Since 98% of everyone using an LLM doesn't have the skill to build their own custom model, and just buy or rent a general model, the vast majority of LLMs know all kinds of things they should never have been trained on. Hence the dirty limericks, racism and bomb recipes.
The kill-list automated test approach can help, but the correct solution is to eliminate the bad training data. Since most folks don't have that expertise, it tends not to happen.
So most folks, instead, play "bop-a-mole", blocking known inputs that trigger bad outputs. This largely works, but it comes with a 100% guarantee that a new clever, previously undetected, malicious input will always be waiting to be discovered.
Right, it's something like trying to get a three year old to eat their peas. It might work. It might also result in a bunch of peas on the floor.
Of course because punctuation isn't going to break a table, but the point is that it's by no means an unforseen or unworkable problem. Anyone could have seen that coming, for example basic SQL and a college class in Java is the extent of my comp sci knowledge and I know about it.
Yeah. It's achievable, just usually not in the ways currently preferred (untrained staff spin it up and hope for the best), and not for the currently widely promised low costs (with no one trained in data science on staff at the customer site).
For a bunch of use cases the lack of security is currently an acceptable trade off.
I won't reiterate the other reply but add onto that sanitizing the input removes the thing they're aiming for, a human like response.
With a password.
Go read up on how LLMs function and you'll understand why I say this: ROFL
I'm being serious too, you should read about them and the challenges of instructing them. It's against their design. Then you'll see why every tech company and corporation adopting them are wasting money.
Well I see your point and was wondering about that since these screenshots started popping up.
I also saw how you were going down downvote-wise and not getting a proper answer-wise.
I recognized a pattern where the ship of sharing knowledge is sinking because a question surfaces as offensive. It happens sometimes on feddit.
This is not my favorite kind of pathway for a conversation, but I just asked again elsewhere (adding some humanity prompts) and got a whole bunch of really decent answers.
Just in case you didn't see it because you were repelled by downvotes.
..dunno, we all forget sometimes this thing is kind of a ship we're on
I appreciate your response! Thanks! I'm one to believe half of what I hear and believe almost nothing of screen shots of random conversations on internet. I find it more likely that someone just made it for internet points.
Cheers!