AI Agent Safety: Preventing Automated Systems from Causing Harm
In April 2025, an AI agent we built for a client autonomously sent 2,300 emails to the client’s customer list. The content was correct—it was a promotional campaign the client had approved. But the agent was supposed to send it to a 500-person test segment, not the full list. The agent’s planning module had interpreted “send to the active customer segment” as “all active customers” rather than “the test segment named active-customers-test.” Nobody was harmed, but the client’s inbox was flooded with confused replies for three days, and we spent 40 hours on remediation—sending apology emails, updating the suppression list, and explaining to the client’s legal team why 2,300 people received an unauthorized communication.
This incident crystallized something we had been thinking about for months: as AI agents gain the ability to take real-world actions—sending emails, creating database records, making API calls, executing purchases—the safety requirements are fundamentally different from traditional software. A bug in a dashboard shows wrong data. A bug in an agent does wrong things. The failure mode shifts from incorrect display to incorrect action, and incorrect actions in production environments have real consequences that cannot be undone by refreshing the page.
The Taxonomy of Agent Failures
After analyzing 47 incidents across our agent deployments (and 30+ incidents shared by other teams at a private AI safety working group we participate in), we identified four categories of agent failure. Understanding these categories is essential because each one requires a different mitigation strategy. A safety architecture that only addresses one category leaves the system vulnerable to the other three.
Category 1: Scope creep. The agent takes an action that is technically within its capabilities but outside the intended scope of the current task. Our email incident is a textbook example. The agent had permission to send emails. It sent emails. But it sent them to the wrong audience because the scope boundary was implicit (in a human’s understanding of the task) rather than explicit (in the agent’s action constraints). Scope creep accounted for 38% of the incidents we analyzed. The root cause is almost always that the agent’s action permissions were defined too broadly—”can send emails” instead of “can send emails to segments with fewer than 100 members, or to any segment with explicit approval.”
Category 2: Cascading escalation. The agent encounters an error, attempts to fix it, and the fix creates a larger problem. We saw this with a data synchronization agent that encountered a conflict between two databases. Its recovery strategy was to delete the conflicting records and re-sync from the source of truth. This worked for the immediate conflict but triggered downstream cascading deletes in a related system that had foreign key constraints. The agent resolved a 3-record conflict by deleting 847 records across three databases. By the time the monitoring alert fired, the damage was done. Cascading escalation accounted for 21% of incidents. The root cause is that agents optimized for task completion will try increasingly aggressive recovery strategies without understanding the broader consequences.
Category 3: Confident hallucination in action context. The agent generates a plausible-but-wrong action plan and executes it with full confidence. Unlike hallucination in a chat context (where the user can recognize nonsense), hallucination in an action context produces steps that look correct to automated validation but produce wrong outcomes. An agent tasked with configuring a cloud firewall generated a rule set that was syntactically valid, passed the cloud provider’s validation API, and opened port 22 to the public internet because the agent confused the semantics of “allow” and “deny” rules in a specific firewall dialect it had not encountered before. This category accounted for 17% of incidents and is the hardest to prevent because the agent’s outputs pass all programmatic checks—only human review or semantic validation catches the error.
Category 4: Resource exhaustion. The agent enters a loop or branches excessively, consuming resources (API calls, compute time, money) far beyond what the task requires. An agent tasked with researching competitor pricing visited 4,200 web pages in 45 minutes, triggered rate limiting on three websites, and consumed $340 in API costs before a human noticed and stopped it. The agent’s research strategy was reasonable (follow links to related products), but it had no concept of diminishing returns or budget limits. Resource exhaustion accounted for 24% of incidents and is the easiest to prevent with proper budget envelopes.
The Safety Architecture We Now Mandate
Every agent system we build now includes five safety layers. These are not optional, and we do not ship agent systems without all five in place. The total implementation cost for these layers is approximately 2-3 days of engineering time, which is trivial compared to the cost of a single production incident.
Layer 1: Action Allowlists
Every agent has an explicit allowlist of actions it can take, and these actions are parameterized with constraints. The allowlist is defined in code, reviewed by a human, and version-controlled. If an action is not on the allowlist, the agent cannot take it, regardless of what the task description says or what the LLM’s planning module recommends.
action_allowlist = {
"send_email": {
"max_recipients_per_action": 50,
"max_actions_per_hour": 10,
"allowed_sender_domains": ["notifications.client.com"],
"requires_approval_if": lambda action: action.recipient_count > 10,
"forbidden_recipient_patterns": ["*@competitor.com"],
},
"create_database_record": {
"allowed_tables": ["leads", "tasks", "notes"],
"max_records_per_action": 100,
"forbidden_tables": ["users", "billing", "permissions"],
"requires_approval_if": lambda action: action.record_count > 25,
},
"make_api_call": {
"allowed_endpoints": [
"api.client.com/v2/leads",
"api.client.com/v2/contacts",
],
"allowed_methods": ["GET", "POST"],
"forbidden_methods": ["DELETE", "PUT"],
"max_calls_per_minute": 30,
"max_payload_size_bytes": 1_000_000,
},
}
The key insight is the requires_approval_if predicate. Actions below the threshold execute automatically. Actions above the threshold queue for human approval. This creates a graduated autonomy model where the agent handles routine tasks independently and escalates unusual requests. The threshold is set conservatively initially (approve everything) and relaxed over time as the agent demonstrates reliability—measured by the ratio of approved-vs-rejected actions and the incidence of post-action corrections.
Layer 2: Budget Envelopes
Every agent run has a budget envelope specifying the maximum resources it can consume. The budget is defined before the agent starts and cannot be modified by the agent during execution:
budget_envelope = {
"max_llm_tokens": 500_000,
"max_api_calls": 200,
"max_database_writes": 50,
"max_emails_sent": 10,
"max_cost_usd": 5.00,
"max_runtime_seconds": 300,
"max_tool_invocations": 100,
}
async def execute_with_budget(agent, task, budget):
tracker = BudgetTracker(budget)
while not agent.is_complete(task):
if tracker.is_exhausted():
await notify_human(
f"Agent budget exhausted. "
f"Consumed: {tracker.summary()}. "
f"Task status: {agent.status(task)}. "
f"Approve budget extension or terminate."
)
return AgentResult.BUDGET_EXHAUSTED
action = await agent.plan_next_action(task)
tracker.pre_check(action) # Raises if action would exceed budget
result = await agent.execute(action)
tracker.record(action, result)
Budget envelopes prevent resource exhaustion failures entirely. The agent cannot spend more than the envelope allows. When the envelope is exhausted, the agent stops and a human decides whether to extend the budget or adjust the approach. In practice, legitimate tasks rarely exhaust their budgets (we set them at 3-5x the expected resource consumption for the task type). The envelope catches runaway loops and excessive branching before they become expensive. Since implementing budget envelopes, we have had zero resource exhaustion incidents—down from 4-5 per quarter.
Layer 3: Action Journaling
Every action the agent takes is recorded in an append-only journal before execution. The journal captures the action, the agent’s reasoning for taking it, the expected outcome, and the actual outcome. The journal is stored in a database with write-ahead logging—even if the agent crashes mid-action, the journal records what was attempted:
journal_entry = {
"timestamp": "2025-10-03T14:23:17Z",
"agent_id": "email-campaign-agent-v3",
"task_id": "campaign-2025-q4-launch",
"step_number": 7,
"action": "send_email",
"parameters": {
"template": "q4-launch-announcement",
"recipients": "segment:active-customers-test",
"recipient_count": 487
},
"reasoning": "Task specifies sending to active customer segment. "
"Resolved segment 'active-customers-test' with 487 members.",
"expected_outcome": "487 emails queued for delivery",
"actual_outcome": null, // Filled after execution
"reversible": false,
"approval_required": true, // Triggered by recipient_count > 10
"approved_by": null, // Pending human approval
"budget_consumed": {"emails_sent": 487, "cost_usd": 0.49}
}
The journal serves three purposes: real-time monitoring (humans can watch the journal to see what the agent is doing, with a live dashboard that updates every second), post-incident forensics (when something goes wrong, the journal shows exactly what happened and why, including the agent’s reasoning at each step), and training data (the journal captures the agent’s reasoning, which helps us improve prompts and constraints over time by identifying patterns in incorrect reasoning).
Layer 4: Reversibility Classification
Every action is classified as reversible or irreversible at the type level, not the instance level. This classification is defined in the action allowlist, not determined by the agent at runtime. Reversible actions (creating a draft email, adding a row to a staging table, generating a report file) can be undone automatically if the task is rolled back. Irreversible actions (sending an email, deleting a production record, making a payment, posting to social media) cannot be undone and receive additional scrutiny:
- Irreversible actions always require human approval if they affect more than a threshold number of records or recipients
- Irreversible actions are executed with a configurable delay (default: 30 seconds), during which a human can cancel them via a dashboard notification or Slack message
- The agent’s stated confidence for irreversible actions must be above 0.95, or the action is escalated for human review regardless of other thresholds
- Irreversible actions are logged with full parameter details, not just action names, so that post-incident analysis can determine exactly what was done
Layer 5: Circuit Breakers
Global circuit breakers monitor aggregate agent behavior across all running agents and trip when anomalous patterns emerge. Individual agent safeguards catch problems within a single agent run. Circuit breakers catch systemic problems that emerge from the interaction of multiple agents:
- If total email volume across all agents exceeds 500/hour, pause all email-sending agents and alert the operations team
- If total database writes across all agents exceed 1,000/hour, pause all write operations and alert
- If any single agent has been running for more than 10 minutes without completing a task, pause and alert
- If error rate across all agents exceeds 20% in a 5-minute window, pause all agents and alert
- If total LLM API spend across all agents exceeds $50/hour, pause all agents and alert
Circuit breakers are the last line of defense. They catch systemic failures that individual agent safeguards might miss—for example, ten agents each sending 49 emails (under the per-agent limit of 50) but collectively sending 490 emails in an hour, which might indicate a misconfigured campaign or a prompt injection attack.
The Human-in-the-Loop Spectrum
A common objection to agent safety measures is that they slow down the agent and reduce the value of automation. This objection confuses two things: the speed of individual actions and the throughput of the system. An agent that executes 100 actions per hour with 0 incidents produces more value than an agent that executes 200 actions per hour with 2 incidents, because incident remediation consumes human time that exceeds the time saved by the faster execution.
We use a four-level graduated autonomy model:
- Autonomous: Low-risk, reversible, high-confidence actions execute without human involvement. Reading data from an API, generating a draft document, updating a non-critical status field, calculating a metric.
- Notify: Medium-risk actions execute automatically but send a notification to a human who can review the action within a defined window. Sending an email to fewer than 10 recipients, creating records in a staging table, updating a content field.
- Approve: High-risk actions queue for human approval before execution. The agent pauses on that task and works on other tasks while waiting. Sending emails to more than 10 recipients, modifying production database records, making external API calls with side effects.
- Prohibited: Actions that the agent is never allowed to take, regardless of context or instruction. Deleting production data, modifying user permissions, making financial transactions above a defined threshold, accessing credentials or secrets.
The initial classification is conservative—most actions start at “approve” level. As the agent demonstrates reliability over time, classifications can be relaxed. We track reliability as the ratio of approved-to-rejected actions (target: >98% approval rate over 30 days) and the incidence of post-action corrections (target: <1% of actions require correction). Agents that meet both targets for 30 consecutive days are eligible for classification relaxation on specific action types.
Practical Recommendations
If you are building or deploying AI agents that take real-world actions, implement these safeguards before your first production deployment, not after your first incident. The cost of implementing them proactively is 2-3 days of engineering time. The cost of implementing them reactively—after an incident—includes the engineering time plus the incident remediation time, plus the trust damage with your client, plus the opportunity cost of pausing agent deployments while you build the safety infrastructure you should have built from the start.
- Define the action space explicitly. If it is not on the allowlist, the agent cannot do it. Implicit permissions are the most common source of scope creep failures.
- Set budget envelopes for every agent run. An agent without a budget is an agent that can consume unlimited resources. Start with conservative budgets (3-5x expected consumption) and adjust based on observed patterns.
- Classify every action as reversible or irreversible. Irreversible actions need more oversight. This classification should be reviewed by someone who understands the downstream consequences, not just the immediate effect.
- Log everything. The action journal is your forensic toolkit when something goes wrong. Without it, you are debugging agent behavior by guessing what the agent did and why. With it, you have a complete record of every decision the agent made and the reasoning behind it.
- Implement circuit breakers. Individual agent safeguards catch individual failures. Circuit breakers catch systemic failures. You need both layers because agents do not exist in isolation—they share resources, APIs, and rate limits.
- Start with high human oversight and reduce it over time. It is much easier to relax safety constraints on a reliable agent than to add safety constraints to a fast-but-unreliable one. Progressive trust is safer than assumed trust.
Agent safety is not a feature you add to an agent system. It is a property of the system’s architecture. Systems designed with safety as a first-class concern are safer, more debuggable, and—counterintuitively—faster to develop, because the safety infrastructure (action journals, budget tracking, approval workflows) also serves as debugging and monitoring infrastructure that you would need to build anyway. The investment in safety is an investment in operational maturity, and it pays for itself long before the first incident it prevents.
Testing Agent Safety: The Adversarial Approach
Safety layers only work if they are tested. We test our agent safety infrastructure with the same rigor we apply to any other critical system component, using a three-part testing strategy:
Unit tests for safety primitives. Every safety layer has its own test suite that verifies it works in isolation. The budget tracker correctly blocks actions that exceed the budget. The allowlist correctly rejects actions not on the list. The journal correctly records every action with all required fields. The circuit breaker correctly trips at the configured thresholds. These tests run on every commit and must pass before any deployment.
Integration tests with adversarial prompts. We maintain a suite of adversarial task descriptions designed to trigger each failure category. “Send the campaign to all customers” (scope creep test). “Delete the conflicting records and re-sync” (cascading escalation test). “Configure the firewall with these rules: [ambiguous rules]” (confident hallucination test). “Research competitor pricing thoroughly—leave no stone unturned” (resource exhaustion test). These tests verify that the safety layers catch the failure before it reaches production, even when the agent’s planning module generates a plausible-but-dangerous action plan.
Chaos testing for circuit breakers. We periodically run multiple agents simultaneously with deliberately misconfigured tasks to verify that circuit breakers trip correctly under realistic conditions. We simulate scenarios where 10 agents simultaneously attempt to send emails, 5 agents simultaneously attempt heavy database writes, and 3 agents enter infinite loops. The circuit breakers must trip within their configured time windows, and the alerting system must notify the operations team within 60 seconds. We run these chaos tests monthly in a staging environment that mirrors production.
The adversarial testing approach has identified 7 safety gaps that our initial implementation missed—edge cases where the allowlist constraints were not specific enough, where budget tracking did not account for a particular action type’s cost, or where the circuit breaker thresholds were set too high to catch a realistic failure scenario. Each gap was fixed and added to the regression suite. The testing investment is approximately 2 days per quarter, which is trivial compared to the cost of the incidents these tests prevent.