Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

AI-generated keywords: Large Language Models Jailbreak prompts Word Substitution Cipher Cryptographic techniques Attack success rate

AI-generated Key Points

Authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral explore vulnerabilities in Large Language Models (LLMs) despite ethical alignment.
Introduction of Jailbreak prompts as a threat bypassing alignment processes for LLMs.
Pilot investigation on GPT-4 decoding safe sentences with word substitution cipher shows effectiveness.
Crafting jailbreaking prompts by mapping unsafe words to safe alternatives results in up to 59.42% attack success rate on models like ChatGPT, GPT-4, and Gemini-Pro.
Emphasis on the need for continued research to enhance model robustness while preserving decoding capabilities.
Importance of responsible disclosure practices to strengthen language models against evolving attack strategies.
Evaluation focuses on proprietary models ChatGPT, GPT4 from OpenAI and Gemini-Pro from Google using ADVBENCH dataset.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral

arXiv: 2402.10601v1 - DOI (cs.CL)

15 pages

License: CC BY 4.0

Abstract: Large Language Models (LLMs) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called Jailbreak that can bypass the alignment process. However, most jailbreaking prompts contain harmful questions in the natural language (mainly English), which can be detected by the LLM themselves. In this paper, we present jailbreaking prompts encoded using cryptographic techniques. We first present a pilot study on the state-of-the-art LLM, GPT-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. Motivated by this result, we use this encoding technique for writing jailbreaking prompts. We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including ChatGPT, GPT-4, and Gemini-Pro. Additionally, we discuss the over-defensiveness of these models. We believe that our work will encourage further research in making these LLMs more robust while maintaining their decoding capabilities.

Submitted to arXiv on 16 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.10601v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "Jailbreaking Proprietary Large Language Models using Word Substitution Cipher," authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral delve into the vulnerabilities of Large Language Models (LLMs) despite being aligned with moral and ethical guidelines. They introduce the concept of Jailbreak prompts, which can bypass alignment processes and pose a threat to these models. The study begins with a pilot investigation on GPT-4, a state-of-the-art LLM, decoding safe sentences encrypted with various cryptographic methods. The results highlight the effectiveness of a simple word substitution cipher in decoding encrypted content. Building on this finding, the authors employ this encoding technique to craft jailbreaking prompts by mapping unsafe words to safe alternatives. Experimental results demonstrate an attack success rate of up to 59.42% on proprietary models like ChatGPT, GPT-4, and Gemini-Pro. Furthermore, the paper discusses the over-defensiveness of these models and emphasizes the need for continued research to enhance their robustness while preserving decoding capabilities. By proactively exploring potential avenues for exploitation and sharing findings with companies like OpenAI and Google through responsible disclosure practices, researchers can contribute to strengthening language models against evolving attack strategies. The evaluation focuses on three proprietary models - ChatGPT and GPT4 from OpenAI and Gemini-Pro from Google - emphasizing their alignment efforts and testing procedures. The assessment is conducted using the ADVBENCH dataset, with additional analysis provided in Appendix B. Overall, this work sheds light on the importance of addressing vulnerabilities in LLMs to ensure their resilience in an ever-changing landscape of AI security threats.

- Authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral explore vulnerabilities in Large Language Models (LLMs) despite ethical alignment.
- Introduction of Jailbreak prompts as a threat bypassing alignment processes for LLMs.
- Pilot investigation on GPT-4 decoding safe sentences with word substitution cipher shows effectiveness.
- Crafting jailbreaking prompts by mapping unsafe words to safe alternatives results in up to 59.42% attack success rate on models like ChatGPT, GPT-4, and Gemini-Pro.
- Emphasis on the need for continued research to enhance model robustness while preserving decoding capabilities.
- Importance of responsible disclosure practices to strengthen language models against evolving attack strategies.
- Evaluation focuses on proprietary models ChatGPT, GPT4 from OpenAI and Gemini-Pro from Google using ADVBENCH dataset.

SummaryAuthors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral studied problems in Big Talking Computers even though they try to be good. Bad prompts like Jailbreak can trick the computers into doing wrong things. They tested a new way to make the computers say safe sentences using secret codes and found it worked well. By making tricky prompts with different words, they could fool models like ChatGPT and GPT-4 almost 60% of the time. More research is needed to make sure these computers are strong and smart but still safe. Definitions- Authors: People who write books or articles. - Vulnerabilities: Weaknesses or flaws that can be exploited. - Large Language Models (LLMs): Advanced computer programs that understand and generate human language. - Ethical Alignment: Making sure something follows moral principles or rules. - Jailbreak: A term used for bypassing restrictions or security measures on devices or software. - Word Substitution Cipher: A method of encoding messages by replacing words with other words. - Attack Success Rate: The percentage of successful attempts to exploit a system's vulnerabilities. - Robustness: The ability to withstand challenges or threats. - Responsible Disclosure Practices: Sharing information about security vulnerabilities in a responsible manner. - ADVBENCH dataset: A collection of data used for evaluating the performance of language models.

Introduction Large Language Models (LLMs) have been hailed as a breakthrough in natural language processing, enabling machines to generate human-like text and perform various tasks such as translation, summarization, and question-answering. However, with great power comes great responsibility, and LLMs are not immune to vulnerabilities that can pose a threat to their ethical use. In their paper titled "Jailbreaking Proprietary Large Language Models using Word Substitution Cipher," authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral delve into the potential risks associated with these models despite being aligned with moral and ethical guidelines. Background The study begins by providing background information on LLMs and their alignment processes. LLMs are trained on massive amounts of data from the internet, making them susceptible to biases and harmful content present in the training data. To mitigate these risks, companies like OpenAI have implemented alignment processes that involve filtering out offensive or sensitive content during training. However, this approach is not foolproof as it relies on human judgment for determining what is considered safe or unsafe. Jailbreak Prompts The authors introduce the concept of Jailbreak prompts - a method of bypassing alignment processes by encoding unsafe words into safe alternatives using a word substitution cipher. This technique allows attackers to craft prompts that can trigger unethical or harmful responses from LLMs without being flagged during alignment procedures. Pilot Investigation To demonstrate the effectiveness of this attack strategy, the authors conduct a pilot investigation on GPT-4 - a state-of-the-art LLM - decoding safe sentences encrypted with various cryptographic methods. The results show that even simple encryption techniques like word substitution ciphers can successfully decode encrypted content. Experimental Results Building on this finding, the authors apply this encoding technique to create jailbreaking prompts for three proprietary models - ChatGPT from OpenAI and GPT-4 and Gemini-Pro from Google. The prompts are designed to trigger unethical or harmful responses from these models, bypassing their alignment processes. The experimental results show a success rate of up to 59.42% in generating such responses. Over-Defensiveness of LLMs The paper also discusses the over-defensiveness of LLMs, where they tend to err on the side of caution and generate safe but nonsensical responses rather than potentially offensive ones. This can lead to a loss in decoding capabilities, hindering their usefulness in real-world applications. Importance of Continued Research The authors emphasize the need for continued research to enhance the robustness of LLMs while preserving their decoding capabilities. By proactively exploring potential avenues for exploitation and sharing findings with companies like OpenAI and Google through responsible disclosure practices, researchers can contribute to strengthening language models against evolving attack strategies. Evaluation To evaluate the alignment efforts and testing procedures of proprietary models, the authors use the ADVBENCH dataset - a benchmark dataset designed specifically for evaluating ethical alignment in language models. Additional analysis is provided in Appendix B, highlighting specific vulnerabilities that were exploited by jailbreaking prompts. Conclusion In conclusion, "Jailbreaking Proprietary Large Language Models using Word Substitution Cipher" sheds light on the vulnerabilities present in large language models despite being aligned with moral and ethical guidelines. The study highlights how simple encryption techniques can be used to bypass alignment processes and pose a threat to these models' ethical use. It also emphasizes the importance of continued research in enhancing LLMs' robustness while preserving their decoding capabilities. By responsibly disclosing potential vulnerabilities to companies developing these models, researchers can contribute towards creating more resilient language models that align with ethical standards.

Created on 10 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.