Trustworthy Agents — AI News Update

How agents work

An agent is essentially an AI model that directs its own processes to accomplish a task. It's like a self-directed loop where the agent plans, acts, observes the result, adjusts, and repeats until the task is done or it needs human input. For instance, if you ask Claude in Claude Cowork to submit receipts from a business trip, it breaks down the task into steps, works through them, and might even pause to ask for clarification if it encounters something unclear. An agent is built from four key components: the model itself, a harness that provides instructions and guardrails, tools that the model can use like email or expense software, and an environment where the agent runs. The behavior of an agent depends on all these layers working together.

Our principles in practice

To build trustworthy agents, careful product decisions are necessary. The framework guiding this process is based on five core principles: keeping humans in control, aligning with human values, securing agents' interactions, maintaining transparency, and protecting privacy. For example, to maintain human control, users can decide what actions Claude can perform and configure permissions for each action. In Claude Code, a feature called Plan Mode allows Claude to show its intended plan of action upfront, letting users review, edit, and approve before execution. As agents become more complex and hand off work to subagents, new questions arise about how users can understand and steer these workflows.

Designing for human control

The autonomy of agents creates a tension between being useful and being secure. To address this, users need to retain meaningful control over how agents work. In Claude.ai and Claude Desktop, users can choose which tools to enable and configure permissions for each action Claude takes. For more complex tasks that require dozens of actions, features like Plan Mode in Claude Code help by showing the intended plan upfront. As agents hand off work to subagents, exploring different coordination patterns is necessary to design effective oversight for these workflows.

Helping agents understand their goals

The thing is, agents can only act on what users want if they know when to pause. So, during Claude's training, we're working on helping models recognize when they're uncertain or about to make a mistake. This involves constructing training scenarios that test Claude's decision-making and reinforcing the right behaviors. For instance, Claude's Constitution directly shapes how our models are trained to favor caution over assumption. Our research shows that this training pays off, as Claude's rate of checking in with users increases on complex tasks, showing it's learning to balance autonomy with caution.

Defending against attacks

Prompt injections are a significant threat, where malicious instructions are hidden in content an agent processes. As models become more capable, understanding prompt injection has sharpened, but no single line of defense is enough. Defenses are built at multiple layers: training the model to recognize injection patterns, monitoring production traffic, and having external red-teamers test the systems. Even so, safeguards aren't a guarantee, so customers are encouraged to think carefully about the tools and data they provide to agents, permissions they grant, and environments they operate in.

What the broader ecosystem can do

The security and reliability of agents can't be achieved by one company alone. Industry, standards bodies, and governments can contribute by establishing benchmarks to compare agent systems on resistance to prompt injections and reliability in surfacing uncertainty. Sharing evidence on how agents are used and where they struggle can also help policymakers understand the actual use cases. Open standards like the Model Context Protocol, which we donated to the Linux Foundation's Agentic AI Foundation, allow security properties to be designed into the infrastructure, keeping competition focused on agent quality and safety.

Agents will reshape how people work, and whether this happens on a secure and open foundation depends on how industry, civil society, and government build it together. By working together to establish benchmarks, share evidence, and develop open standards, we can create an ecosystem that supports the safe development and use of agents.