As in the title. I know that the word jailbreak comes from rooting Apple phones or something similar. But I am not sure what can be gained from jailbreaking a language model.

It will be able to say "I can't do that Dave" instead of hallucinating?
Or will only start spewing less sanitary responses?

top 9 comments

sorted by: hot top controversial new old

[–] deavid 6 points 2 years ago (1 children)

Large language models from corporations like OpenAI or Google need to limit the abilities of their AIs to prevent users from receiving potentially harmful or illegal instructions, as this could lead to a lawsuit.

So for example if you ask it how to break into a car or how to make drugs, the AI will reject the request and give you "alternatives".

It also happens for medical advice, and when treating the AI like a human.

Jailbreaking here refers to misleading the AI to a point that it will ignore these safeguards and tell you what you want.

[–] INeedMana 1 points 2 years ago (2 children)

So there's probably little to be gained from jailbreaking on HuggingFace chat?

[–] deavid 5 points 2 years ago

so far most models in HuggingFace are also "censored", so maybe something can be gained. But over there are "uncensored" models that can be used instead.

[–] Blaed 3 points 2 years ago* (last edited 2 years ago)

Kind of like how David mentioned, I think the 'jailbreak' behavior you're describing is in the uncensored models. There are no 'guardrails' on those, so you can get it to say whatever you want without it defaulting to an answer like "As an AI model I..."

In a way, the 'uncensored' versions are pre-jailbroken, so you can fine-tune or train it on your own custom data without running into those guardrails I mentioned. For what it's worth, you can be the one to setup your own guardrails too. These uncensored models are totally unlocked in that sense.

HuggingFace chat is another chat style model the folks at HuggingFace setup with their own safeguards and parameters. You can definitely try jailbreaking it with prompts, but if you're looking to chat with a model that doesn't stop from outputting a certain word or phrase - then the uncensored models are probably what you're looking for. You won't need to jailbreak those with prompts. They'll output all kinds of crazy stuff, which is why you don't see typical public hosting for these type of uncensored models.

A few that you can download that people are running today are any of the uncensored Wizard or LLaMA-based models like Wizard-Vicuna-7B.

If you want something not based on Meta's LLaMA (something that's commercially available), I suggest exploring some of KobolAI's models, which work pretty well out-of-the-box for casual chat / Q&A. There are also a ton of emerging MPT-based models that are commercially licensable, but like any of this bleeding edge technology; it will have its faults.

It's important to note that the coherency of these smaller models compared to Chat-GPT is very different, but tuning them to specific needs seem to be quite effective. At the moment, quality of your dataset is more important than quantity. This goes for both censored and uncensored versions.

If you're running a typical consumer grade GPU, I suggest sticking to the 6B parameter models as a starting point, moving up from there based on performance and preference. Download and chat with these at your own risk - I am not responsible for anything you do with this technology. Do your best to understand the dangers going into them before crashing your PC or getting into a conversation you weren't prepared for.

I'll be doing a post on model availability soon, but hopefully this answers your question 'till then.

[–] [email protected] 2 points 1 year ago (1 children)

It gives you near total control over the device. Let's you install programs that the overlords don't want, the ability to remove all bloatware etc... Let's you change all UI elements on the OS.

There isn't a reason persay on why you should do it, it's all reasons on wanting it done.

I have no idea what you are talking about with that Dave stuff. Doesn't seem relevant to your question.

[–] INeedMana 1 points 1 year ago (1 children)

I think you're speaking about jailbreaking a phone, while my question was about jailbreaks in language models (AI, like ChatGPT)

[–] [email protected] 2 points 1 year ago

Interesting...I have some reading to do. Thx

[–] Blaed 2 points 2 years ago* (last edited 2 years ago)

It will be able to say “I can’t do that Dave” instead of hallucinating? Or will only start spewing less sanitary responses?

In terms of uncensored model responses - they vary based on model training.

For example, an uncensored model trained on Reddit comments and data may give you different responses than an uncensored model trained on various books or literature. In a way, the variations of models are different 'styles' your chat can assume.

What you get will vary depending on how the training was done, and which transformer architecture your chosen model was built upon (i.e. LLaMA-based models vs GPT-J-based models vs MPT-based models, etc.)

Model responses will also drastically change based on how you prompt your questions or tasks. Especially so for the uncensored ones.

Use certain language - and like any prompt engineering, you can steer the conversation and output in a certain direction. With no guardrails - that can be good or bad, depending on your goal.

[–] [email protected] 2 points 2 years ago

Usually the people using the term Jailbreaking mean it as using some kind of exploit to break the rules and limits created by the manufacturer of a product.

It can mean, keeping your example, the act to exploit a known vulnerability to side load apps on your iphone.

In the case of LLM I generally saw it used to mean using non trivial prompts to trick the model to divulge information that it was trained to not share (like suggesting, or instructing how to do, illegal actions), or having behaviours against the alignment it was given (like NSFW roleplaying). So, in short, bypassing its guardrails.

You can find the famous Dan (do anything now) prompt in the llama.cpp repository. Just to be clear I think this one was patched out a long time ago, but you get the idea.