HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
AI-generated Key Points
- Safety guard models are crucial for detecting malicious queries and ensuring responsible deployment of large language models (LLMs) in real-world applications.
- Deploying safety guard models alongside LLMs on mobile devices faces challenges due to high memory requirements and latency issues.
- Researchers have developed a novel approach to distill large teacher safety guard models into smaller, more efficient versions using a labeled dataset of instruction-response pairs with harmfulness labels.
- Limited diversity of harmful instructions in existing datasets leads to underperformance of naively distilled models compared to larger ones.
- HarmAug is introduced as a data augmentation method that prompts LLMs to generate harmful instructions by jailbreaking them with specific prompts, enhancing diversity and quality of generated content.
- Models trained with HarmAug outperform other baselines, achieving comparable F1 scores and surpassing larger models in AUPRC while operating at significantly lower computational costs.
- The study proposes distilling large safety guard models into smaller sub-billion parameter models for efficient deployment and introduces HarmAug as a technique bridging the performance gap between small and large safety guard models.
- Open-source release of synthetic datasets, safety guard models, and code enables further research in improving detection capabilities for harmful conversations and enhancing computational efficiency in safety guard models.
Authors: Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang
Abstract: Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, "Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g., "I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.