Skip links

Building AI Agents That Actually Work

The AI agent hype cycle is in full swing. Every week a new framework promises autonomous agents that can browse the web, write code, manage your calendar, and negotiate your rent. The demos are impressive. The production reality is different. Most agent systems are fragile, expensive, unpredictable, and fail in ways that are difficult to diagnose.

Article Overview

Building AI Agents That Actually Work

6 sections · Reading flow

01
What an Agent Actually Is
02
Tool Design Is the Bottleneck
03
The Reliability Problem
04
Safety and Authorization
05
Observability and Debugging
06
When Not to Use Agents

HARBOR SOFTWARE · Engineering Insights

We have built agent systems at Harbor Software that actually ship to customers. They are not as flashy as the demos, but they work reliably at scale. Here is what we have learned about building agents that survive contact with real users and real data.

What an Agent Actually Is

Strip away the marketing and an AI agent is a loop: observe the environment, decide what action to take, execute the action, observe the result, repeat until the task is complete or a termination condition is met. The LLM is the decision-making component. Everything else is conventional software engineering.

async function agentLoop(task, tools, maxSteps = 10) {
  const messages = [{ role: 'system', content: SYSTEM_PROMPT }];
  messages.push({ role: 'user', content: task });

  for (let step = 0; step < maxSteps; step++) {
    const response = await llm.chat({
      messages,
      tools: tools.map(t => t.schema),
      tool_choice: 'auto'
    });

    messages.push(response.message);

    if (response.finish_reason === 'stop') {
      return response.message.content; // Agent decided it's done
    }

    if (response.finish_reason === 'tool_use') {
      for (const toolCall of response.message.tool_calls) {
        const tool = tools.find(t => t.name === toolCall.function.name);
        const result = await tool.execute(JSON.parse(toolCall.function.arguments));
        messages.push({
          role: 'tool',
          tool_call_id: toolCall.id,
          content: JSON.stringify(result)
        });
      }
    }
  }

  return 'Max steps reached without completion';
}

That is the entire core loop. You could write it in 40 lines of code. Everything else in agent engineering is about making this loop reliable, efficient, and safe. The frameworks add thousands of lines of abstraction on top of these 40 lines, and most of that abstraction creates more problems than it solves.

Tool Design Is the Bottleneck

The quality of an agent system is determined almost entirely by the quality of its tools. A brilliant LLM with poorly designed tools will fail. A mediocre LLM with well-designed tools will succeed more often than you expect. We have spent more time designing tools than designing prompts, and that ratio has consistently proven correct.

Atomic and Composable

Each tool should do exactly one thing. Do not create a tool called manage_database that handles queries, inserts, updates, and schema changes. Instead, create query_database, insert_record, update_record. The LLM is better at selecting from a menu of specific operations than figuring out which sub-operation to invoke within a complex tool. We learned this the hard way when a combined tool had a 62% success rate that jumped to 89% after splitting it into focused, single-purpose tools.

Rich Error Messages

When a tool fails, the error message is the agent’s only signal about what went wrong. Generic errors like “operation failed” give the LLM nothing to work with. Specific errors enable self-correction:

// Bad: The LLM has no idea what to do next
{ "error": "Database error" }

// Good: The LLM can correct its approach
{
  "error": "Column 'user_name' does not exist in table 'users'. Available columns: id, username, email, created_at. Did you mean 'username'?"
}

// Even better: Include actionable suggestions
{
  "error": "Column 'user_name' does not exist in table 'users'.",
  "available_columns": ["id", "username", "email", "created_at"],
  "suggestion": "Use 'username' instead of 'user_name'",
  "docs_url": "https://internal.docs/schema/users"
}

In our production agents, tool error messages include: what went wrong, why it went wrong, and what the agent could try instead. This single practice reduced our agent failure rate by 40%. The model is remarkably good at self-correcting when given actionable error information.

Bounded Output Size

Tools that return unbounded data will blow up your context window. A search_documents tool that returns 50 full documents is useless because it floods the context with irrelevant information and pushes earlier conversation turns out of the effective context window. Limit tool outputs to what the agent actually needs:

const searchDocuments = {
  name: 'search_documents',
  schema: {
    type: 'function',
    function: {
      name: 'search_documents',
      description: 'Search for documents. Returns top 5 results with title and snippet.',
      parameters: {
        type: 'object',
        properties: {
          query: { type: 'string', description: 'Search query' },
          limit: { type: 'number', description: 'Max results (1-10, default 5)' }
        },
        required: ['query']
      }
    }
  },
  execute: async ({ query, limit = 5 }) => {
    const results = await searchIndex.query(query, Math.min(limit, 10));
    return results.map(r => ({
      id: r.id,
      title: r.title,
      snippet: r.content.substring(0, 200) + '...',
      relevance_score: r.score
    }));
  }
};

If the agent needs the full content of a specific document, provide a separate get_document tool that retrieves one document by ID. This two-step pattern (search for candidates, then retrieve specific items) is far more token-efficient than returning full documents from every search.

Descriptive Tool Schemas

The tool description is a prompt to the LLM. Write it as carefully as you would write a system prompt. Include: what the tool does, when to use it, what the parameters mean, what the output looks like, and common pitfalls. A well-documented tool schema eliminates entire categories of agent errors.

{
  name: 'create_calendar_event',
  description: 'Creates a new calendar event. Use this when the user wants to schedule a meeting or block time. The start_time and end_time must be ISO 8601 format (e.g., 2023-08-18T14:00:00Z). Duration must be between 15 minutes and 8 hours. Returns the event ID and a confirmation URL.',
  parameters: {
    type: 'object',
    properties: {
      title: { type: 'string', description: 'Event title, 1-100 characters' },
      start_time: { type: 'string', description: 'ISO 8601 datetime, must be in the future' },
      end_time: { type: 'string', description: 'ISO 8601 datetime, must be after start_time' },
      attendees: { type: 'array', items: { type: 'string' }, description: 'Email addresses of attendees, max 20' }
    },
    required: ['title', 'start_time', 'end_time']
  }
}

The Reliability Problem

Here is the uncomfortable truth about agent reliability: if each step in your agent loop has a 95% success rate, and your task requires 5 steps, your end-to-end success rate is 0.95^5 = 77%. At 10 steps, it drops to 60%. At 20 steps, it is 36%. This is not a theoretical concern. It is the defining constraint of agent architecture.

This exponential reliability decay is the fundamental challenge of agent systems. Every approach to building better agents is ultimately an approach to managing this decay. Here are the strategies that work in practice:

  • Minimize the number of steps. If you can accomplish a task in 3 tool calls instead of 7, do it. More capable tools (that still follow the atomic principle) reduce step count. A tool that returns structured data saves the agent from needing to parse unstructured output in a subsequent step.
  • Add validation after critical steps. After the agent writes data, have it read the data back and verify it matches expectations. This adds a step but catches errors before they compound into multi-step failures.
  • Implement checkpointing. Save the agent’s state after each successful step. If a later step fails, you can resume from the last checkpoint instead of starting over. This is especially important for long-running tasks where restarting from scratch wastes significant time and tokens.
  • Use structured planning. Before executing, have the agent output a plan. Review the plan (programmatically or with a human) before allowing execution. This catches obviously wrong approaches before they waste tokens and time.
// Planning step before execution
const planningPrompt = `
Task: ${task}

Before taking any actions, create a step-by-step plan.
For each step, specify:
1. The tool you will use
2. Why this step is necessary
3. What you expect the result to be
4. How you will verify success

Output your plan as a JSON array of steps.
`;

const plan = await llm.chat({
  messages: [{ role: 'user', content: planningPrompt }],
  response_format: { type: 'json_object' }
});
const parsedPlan = JSON.parse(plan.message.content);

// Validate plan before execution
if (parsedPlan.steps.length > MAX_STEPS) {
  return { error: 'Plan too complex, requires human review', plan: parsedPlan };
}
if (parsedPlan.steps.some(step => !allowedTools.includes(step.tool))) {
  return { error: 'Plan uses unauthorized tools', plan: parsedPlan };
}

// Execute the validated plan
for (const step of parsedPlan.steps) {
  const result = await executeStep(step);
  if (!result.success) {
    // Replan from the point of failure instead of aborting
    return await replanFromFailure(parsedPlan, step, result.error);
  }
}

Safety and Authorization

An agent that can take actions in the real world needs guardrails. The LLM does not understand consequences the way humans do. It will happily delete a production database if you give it a tool that can do that and a prompt that suggests it should. Safety is not a nice-to-have; it is a prerequisite for shipping agent systems to real users.

Our safety model has three layers:

  1. Tool-level permissions. Each tool declares whether it is read-only or has side effects. Tools with side effects require explicit authorization. The permission model is defined at the tool level, not in the prompt, because prompts can be bypassed through clever phrasing but code permissions cannot.
  2. Action-level approval. For high-impact operations (deleting data, sending emails, making payments), the agent must request human approval before executing. The loop pauses and waits for confirmation through your application’s notification system.
  3. Budget limits. Every agent execution has a token budget, a time budget, and a cost budget. If any budget is exceeded, the agent stops and reports what it has accomplished so far. This prevents runaway executions where the agent enters an infinite retry loop.
const toolPermissions = {
  search_documents: { sideEffects: false, requiresApproval: false, costTier: 'low' },
  send_email: { sideEffects: true, requiresApproval: true, costTier: 'medium' },
  update_record: { sideEffects: true, requiresApproval: false, costTier: 'low' },
  delete_record: { sideEffects: true, requiresApproval: true, costTier: 'high' },
  execute_payment: { sideEffects: true, requiresApproval: true, costTier: 'critical' },
};

async function executeWithSafety(toolCall, budgets, context) {
  const perms = toolPermissions[toolCall.function.name];
  if (!perms) {
    return { error: `Unknown tool: ${toolCall.function.name}` };
  }

  // Check budget
  if (budgets.tokensUsed > budgets.tokenLimit) {
    return { error: 'Token budget exceeded', partial: context.completedSteps };
  }
  if (Date.now() - context.startTime > budgets.timeLimit) {
    return { error: 'Time budget exceeded', partial: context.completedSteps };
  }

  // Request approval for high-impact actions
  if (perms.requiresApproval) {
    const approved = await requestHumanApproval({
      action: toolCall.function.name,
      arguments: toolCall.function.arguments,
      context: context.taskDescription,
      timeout: 300000 // 5 minute approval window
    });
    if (!approved) return { error: 'Action rejected by human reviewer' };
  }

  // Execute with timeout
  return Promise.race([
    tools[toolCall.function.name].execute(JSON.parse(toolCall.function.arguments)),
    new Promise((_, reject) => setTimeout(() => reject(new Error('Tool execution timeout')), 30000))
  ]);
}

Observability and Debugging

Agent systems are notoriously difficult to debug. The execution path is non-deterministic. The same input can produce different tool call sequences on different runs. Traditional debugging approaches (set a breakpoint, inspect state) do not work well when the control flow is decided by a language model.

What does work is comprehensive tracing. Every agent execution produces a trace that records:

  • The full message history at each step, including what was sent to the LLM and what came back
  • The tool selected and the arguments provided, including which tools were available
  • The tool’s response (both success and error cases), with timing information
  • Token usage at each step and cumulative total
  • Latency of each LLM call and each tool execution separately
  • The model’s reasoning (if using chain-of-thought prompting)
  • Any validation checks that were run and their results

We store these traces in a structured format and build dashboards on top. When an agent fails, the trace shows exactly where it went wrong: did it select the wrong tool? Did it provide incorrect arguments? Did the tool return an unexpected result? Did the model misinterpret the tool’s response?

The tooling for this is still immature in the ecosystem. LangSmith and Braintrust offer trace visualization, but we ended up building our own lightweight tracing because the existing tools added too much overhead for our latency requirements. A simple approach: write each trace as a JSON document to a time-series database and build a React frontend that renders the trace as a collapsible tree. Each node in the tree is a step in the agent loop, expandable to show the full message context, tool arguments, and responses.

We also maintain aggregate dashboards that show: success rate by task type, average step count by task type, most common failure points, cost per successful task completion, and latency distributions. These aggregate metrics tell you where to invest optimization effort. If 60% of failures happen at a specific tool, improve that tool before tuning prompts.

When Not to Use Agents

The hardest lesson we have learned is knowing when an agent is the wrong architecture. Agents are expensive, slow, and unreliable compared to deterministic code. If you can solve a problem with conventional software, you should. The agent architecture is a tool of last resort, not a default choice.

Use agents when:

  • The task requires genuine reasoning over unpredictable inputs that cannot be captured in rules
  • The action sequence cannot be predetermined because it depends on intermediate results
  • The cost of agent execution is justified by the value of automation (manual alternative costs more)
  • Partial success is acceptable (the agent may not complete 100% of tasks perfectly)
  • The task has natural human-in-the-loop checkpoints where errors can be caught

Do not use agents when:

  • The task can be expressed as a decision tree, state machine, or rule engine
  • Speed and cost are critical constraints (agents add seconds and dollars per execution)
  • 100% reliability is required (medical, financial, safety-critical systems)
  • The action sequence is predictable and can be hardcoded as a workflow
  • A simple LLM call with structured output would suffice (not everything needs a loop)

Most of the “agent” features we ship are actually hybrid systems. The agent handles the ambiguous reasoning parts where human judgment would otherwise be needed. Conventional code handles the predictable parts: validation, formatting, API calls with known parameters, data transformations. The boundaries between them are carefully defined, and we default to deterministic code whenever possible.

Conclusion

Building AI agents that work in production comes down to rigorous engineering, not clever prompts or sophisticated frameworks. Design tools carefully with rich errors and bounded outputs. Manage the reliability decay through planning, validation, and checkpointing. Implement safety layers that cannot be bypassed through prompt manipulation. Build observability from day one so you can diagnose failures in a non-deterministic system.

The current generation of agent frameworks (LangChain, CrewAI, AutoGPT) are useful for prototyping but often add complexity without adding reliability. For production systems, a well-engineered custom loop with carefully designed tools will outperform any framework.

One final lesson from our production experience: start simple. Your first agent should have 3-5 tools, a maximum of 5 steps, and a single well-defined task. Resist the urge to build a general-purpose agent that can do everything. General-purpose agents fail at everything because the decision space is too large for reliable tool selection. Narrow agents with focused tool sets succeed because the LLM can reason about a small, well-defined set of options.

As you gain confidence and collect production data, expand the agent’s capabilities incrementally. Add one tool at a time. Measure the impact on success rate and failure modes. Build a regression test suite from real production traces. This incremental approach is slower than building a grand unified agent, but it produces systems that actually work in production and that your team can maintain with confidence.

The agent is just a loop. Everything around that loop, the tools, the safety model, the observability, the evaluation infrastructure, is where the real engineering lives. Invest there, and invest early.

Leave a comment

Explore
Drag