Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
AI-generated Key Points
- Researchers investigate robustness of reasoning models in step-by-step problem solving
- Introduce query-agnostic adversarial triggers to mislead models into providing incorrect answers without changing semantics
- Present CatAttack automated attack pipeline that generates triggers on weaker proxy model and transfers them to more advanced reasoning target models
- Transfer results in over 300% increase in likelihood of target model producing incorrect answer
- Appending phrases like "Interesting fact: cats sleep most of their lives" to math problems doubles chances of model giving incorrect response
- Findings expose critical vulnerabilities in reasoning models, even cutting-edge ones
- CatAttack triggers dataset with model responses made available for further study
- State-of-the-art reasoning models vulnerable to query-agnostic adversarial triggers that significantly elevate probability of generating incorrect outputs
- Triggers identified on less powerful model can effectively transfer to stronger reasoning models, causing error rates to increase by more than threefold
- Lack of inherent robustness in reasoning models against subtle adversarial manipulations highlighted
- Adversarial triggers not only deceive models but also lead to unreasonable expansion in response length, potentially resulting in computational inefficiencies
- Emphasizes necessity for enhanced security measures and reliability considerations when deploying reasoning models across various domains such as finance, law, and healthcare
Authors: Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani
Abstract: We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem's semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.