Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

AI-generated keywords: Large Language Models Jailbreak prompts Word Substitution Cipher Cryptographic techniques Attack success rate

AI-generated Key Points

  • Authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral explore vulnerabilities in Large Language Models (LLMs) despite ethical alignment.
  • Introduction of Jailbreak prompts as a threat bypassing alignment processes for LLMs.
  • Pilot investigation on GPT-4 decoding safe sentences with word substitution cipher shows effectiveness.
  • Crafting jailbreaking prompts by mapping unsafe words to safe alternatives results in up to 59.42% attack success rate on models like ChatGPT, GPT-4, and Gemini-Pro.
  • Emphasis on the need for continued research to enhance model robustness while preserving decoding capabilities.
  • Importance of responsible disclosure practices to strengthen language models against evolving attack strategies.
  • Evaluation focuses on proprietary models ChatGPT, GPT4 from OpenAI and Gemini-Pro from Google using ADVBENCH dataset.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral

15 pages
License: CC BY 4.0

Abstract: Large Language Models (LLMs) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called Jailbreak that can bypass the alignment process. However, most jailbreaking prompts contain harmful questions in the natural language (mainly English), which can be detected by the LLM themselves. In this paper, we present jailbreaking prompts encoded using cryptographic techniques. We first present a pilot study on the state-of-the-art LLM, GPT-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. Motivated by this result, we use this encoding technique for writing jailbreaking prompts. We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including ChatGPT, GPT-4, and Gemini-Pro. Additionally, we discuss the over-defensiveness of these models. We believe that our work will encourage further research in making these LLMs more robust while maintaining their decoding capabilities.

Submitted to arXiv on 16 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.10601v1

In their paper titled "Jailbreaking Proprietary Large Language Models using Word Substitution Cipher," authors Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral delve into the vulnerabilities of Large Language Models (LLMs) despite being aligned with moral and ethical guidelines. They introduce the concept of Jailbreak prompts, which can bypass alignment processes and pose a threat to these models. The study begins with a pilot investigation on GPT-4, a state-of-the-art LLM, decoding safe sentences encrypted with various cryptographic methods. The results highlight the effectiveness of a simple word substitution cipher in decoding encrypted content. Building on this finding, the authors employ this encoding technique to craft jailbreaking prompts by mapping unsafe words to safe alternatives. Experimental results demonstrate an attack success rate of up to 59.42% on proprietary models like ChatGPT, GPT-4, and Gemini-Pro. Furthermore, the paper discusses the over-defensiveness of these models and emphasizes the need for continued research to enhance their robustness while preserving decoding capabilities. By proactively exploring potential avenues for exploitation and sharing findings with companies like OpenAI and Google through responsible disclosure practices, researchers can contribute to strengthening language models against evolving attack strategies. The evaluation focuses on three proprietary models - ChatGPT and GPT4 from OpenAI and Gemini-Pro from Google - emphasizing their alignment efforts and testing procedures. The assessment is conducted using the ADVBENCH dataset, with additional analysis provided in Appendix B. Overall, this work sheds light on the importance of addressing vulnerabilities in LLMs to ensure their resilience in an ever-changing landscape of AI security threats.
Created on 10 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.