SecureClaw: Clawing Back Control of LLM Agents

AI-generated keywords: SecureClaw

AI-generated Key Points

SecureClaw is a dual-boundary architecture designed to address security failures faced by large language model (LLM) agents
It secures both boundaries simultaneously by enforcing authorization at the effect sink and implementing plaintext confinement at the read boundary
The architecture includes a PREVIEW$\rightarrow$COMMIT protocol for writes that change external state, ensuring only trusted executors can commit authorized requests
Through rigorous evaluation, SecureClaw emerges as a standout defense with impressive security outcomes, including 0% attack success rate on ASB and low leak rates on AgentDojo and AgentLeak
In practice, SecureClaw prevents exposure of sensitive information to the runtime by providing handles and summarized data for processing, safeguarding against unauthorized disclosure

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuhan Ma, Stefan Schmid

arXiv: 2606.09549v1 - DOI (cs.CR)

License: CC BY 4.0

Abstract: Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

Submitted to arXiv on 08 Jun. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2606.09549v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , SecureClaw is a dual-boundary architecture designed to address the security failures faced by tool-using large language model (LLM) agents. These failures include unauthorized external actions and exposure of sensitive plaintext within the runtime before any final output check can intervene. Existing defenses typically focus on protecting either the planner/runtime or the action sink, leaving one surface vulnerable. In response to this challenge, SecureClaw introduces a novel approach that secures both boundaries simultaneously. Authorization is enforced at the effect sink, while plaintext confinement is implemented at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and provides bounded summaries as an explicit declassification interface in evaluated deployments. Furthermore, SecureClaw implements a PREVIEW$\rightarrow$COMMIT protocol for writes that change external state. This protocol ensures that only a trusted executor can commit the exact canonical request authorized by policy, preventing unauthorized changes to external systems. The runtime retains the ability to plan over summaries and symbolic references but is restricted from directly accessing secrets or performing side effects. Through rigorous evaluation across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw emerges as the standout defense in a common harness setting. It not only maintains task utility but also achieves impressive security outcomes, including a 0% attack success rate (ASR) on ASB, 0.64% ASR on AgentDojo, and just 3.23% overall leak on AgentLeak's attacked parity lane – which measures final-output and internal-relay leakage. To illustrate its effectiveness in practice, consider an email-and-finance workflow where an agent must inspect an invoice and take appropriate action based on its contents. SecureClaw ensures that sensitive information from the invoice is not exposed to the runtime; instead, it receives a handle and a summarized version of the data for processing. This approach prevents malicious actors from manipulating the agent into sending confidential information to unauthorized recipients. Overall, SecureClaw's innovative design strikes a balance between usability and security by enforcing strict controls at both boundaries of LLM agents' operations. Its capability-style design ensures that authority is carried by unforgeable references, safeguarding against potential attacks and data leaks effectively.

- SecureClaw is a dual-boundary architecture designed to address security failures faced by large language model (LLM) agents
- It secures both boundaries simultaneously by enforcing authorization at the effect sink and implementing plaintext confinement at the read boundary
- The architecture includes a PREVIEW$\rightarrow$COMMIT protocol for writes that change external state, ensuring only trusted executors can commit authorized requests
- Through rigorous evaluation, SecureClaw emerges as a standout defense with impressive security outcomes, including 0% attack success rate on ASB and low leak rates on AgentDojo and AgentLeak
- In practice, SecureClaw prevents exposure of sensitive information to the runtime by providing handles and summarized data for processing, safeguarding against unauthorized disclosure

SummarySecureClaw is a special design to keep big talking computers safe. It makes sure only the right people can do things and stops secrets from getting out. It has a special way for making changes that need permission, so only trusted helpers can finish them. SecureClaw is really good at stopping bad guys from breaking in and doesn't let important information get out by giving a summary instead. Definitions- SecureClaw: A special system to protect big talking computers. - Authorization: Giving permission to do something. - Plaintext confinement: Keeping secrets safe from being seen by others. - Trusted executors: People who are allowed to finish important tasks. - Unauthorized disclosure: Sharing secret information without permission.

Introduction

In recent years, large language model (LLM) agents have become increasingly popular for various tasks such as text generation, translation, and summarization. These agents are trained on massive amounts of data and can generate human-like text with impressive accuracy. However, this advancement in natural language processing has also raised concerns about the security of these models. A research paper titled "SecureClaw: Dual-Boundary Architecture for Securing Large Language Model Agents" addresses the security failures faced by tool-using LLM agents. The paper introduces a novel approach that secures both boundaries of LLM agent operations simultaneously to prevent unauthorized external actions and exposure of sensitive plaintext within the runtime.

The Challenge

Existing defenses for LLM agents typically focus on protecting either the planner/runtime or the action sink, leaving one surface vulnerable. This leaves room for potential attacks where malicious actors can manipulate the agent into performing unauthorized actions or leaking sensitive information before any final output check can intervene. To address this challenge, SecureClaw proposes a dual-boundary architecture that enforces authorization at the effect sink and implements plaintext confinement at the read boundary.

Authorization at Effect Sink

The effect sink is responsible for executing actions based on an agent's outputs. In SecureClaw's architecture, authorization is enforced at this boundary to ensure that only authorized actions are performed by the agent. This is achieved through a PREVIEW$\rightarrow$COMMIT protocol where writes that change external state must be approved by a trusted executor before being committed. This protocol ensures that only canonical requests authorized by policy are executed, preventing unauthorized changes to external systems.

Plaintext Confinement at Read Boundary

The read boundary is responsible for handling sensitive reads from external sources. In SecureClaw's design, all sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and provides bounded summaries as an explicit declassification interface in evaluated deployments. This means that the runtime only receives summarized versions of sensitive data, preventing it from directly accessing secrets or performing side effects. This approach effectively safeguards against potential attacks where sensitive information is leaked through the agent's operations.

Evaluation

To test the effectiveness of SecureClaw, the researchers conducted rigorous evaluations across three different benchmarks – AgentDojo, AgentLeak, and Agent Security Bench (ASB). These benchmarks measure various aspects of LLM agents' security, including attack success rate (ASR) and leakage. SecureClaw emerged as the standout defense in all three benchmarks. It achieved a 0% ASR on ASB, 0.64% ASR on AgentDojo, and just 3.23% overall leak on AgentLeak's attacked parity lane – which measures final-output and internal-relay leakage.

Real-World Application

To illustrate its effectiveness in practice, consider an email-and-finance workflow where an agent must inspect an invoice and take appropriate action based on its contents. With SecureClaw's dual-boundary architecture in place, any sensitive information from the invoice will be replaced with opaque handles before being passed to the runtime for processing. This ensures that even if a malicious actor tries to manipulate the agent into sending confidential information to unauthorized recipients, they will only receive a handle instead of actual sensitive data.

Conclusion

In conclusion, SecureClaw's innovative dual-boundary architecture strikes a balance between usability and security for LLM agents. By securing both boundaries simultaneously through strict controls at each boundary point, it effectively prevents potential attacks and data leaks without compromising task utility. The paper presents compelling evidence through rigorous evaluations to demonstrate SecureClaw's superiority over existing defenses in common harness settings. Its capability-style design ensures that authority is carried by unforgeable references, providing a robust defense against potential attacks on LLM agents.

Created on 10 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

48.1%

Defeating Prompt Injections by Design

cs.CR

47.2%

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathwa…

cs.CR

44.9%

From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-In…

cs.CR

43.7%

InfoFlood: Jailbreaking Large Language Models with Information Overload

cs.CR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.