Designing APIs for AI Agents: New Patterns for a New Consumer

Author Sarah Chen

Published on: May 2, 2025

AI agents are a new category of API consumer. They are not humans clicking through a UI, and they are not scripts following a predetermined sequence of API calls. They are semi-autonomous programs that decide at runtime which endpoints to call, in what order, and with what parameters — based on a natural language goal and whatever context they have accumulated. This changes API design in ways that most teams have not yet grappled with. At Harbor Software, we have spent the past eight months adapting our APIs for agent consumption, and the patterns that work are surprisingly different from what works for human developers or traditional integrations.

Article Overview

Designing APIs for AI Agents: New Patterns for a New Cons…

7 sections · Reading flow

01
Agents Read Your API Differently

→

02
Idempotency Is Non-Negotiable

→

03
Structured Errors That Agents Can Act On

→

04
Batch Endpoints Reduce Token Costs

→

05
Long-Running Operations Need Polling, Not Webhooks

→

06
Rate Limiting for Agents: Token-Based, Not…

→

07
Versioning for Agents Is Stricter Than for Humans

HARBOR SOFTWARE · Engineering Insights

Agents Read Your API Differently

When a human developer integrates with your API, they read your documentation, understand the data model, write integration code, test it, and deploy. The integration logic is static — it does the same thing every time. An AI agent, by contrast, receives your API schema (typically as an OpenAPI spec or a function-calling tool definition) and decides at runtime how to use it based on the user’s request.

This means your API’s discoverability and self-documentation are not nice-to-haves — they are functional requirements. An agent that cannot understand what an endpoint does from its name, description, and parameter definitions will either skip it or use it incorrectly. We audited our API through the lens of “can GPT-4 figure out what this does from the OpenAPI spec alone” and found that 40% of our endpoints had ambiguous descriptions that led to incorrect agent behavior.

The audit process was revealing. We took our OpenAPI spec, fed it to GPT-4 as tool definitions, and gave it 50 natural language tasks that should map to specific API calls. For each task, we recorded whether the agent selected the correct endpoint, used the correct parameters, and produced a correct result. The 40% failure rate broke down as follows: 18% of failures were caused by ambiguous operation names (getItems could mean products, projects, or list items), 12% were caused by missing parameter descriptions (the agent guessed parameter formats incorrectly), and 10% were caused by missing response descriptions (the agent could not determine whether the operation succeeded).

// Before — ambiguous, works for humans who read the docs
{
  "operationId": "getItems",
  "summary": "Get items",
  "parameters": [
    { "name": "type", "in": "query", "schema": { "type": "string" } },
    { "name": "status", "in": "query", "schema": { "type": "string" } }
  ]
}

// After — explicit, works for agents that have no external context
{
  "operationId": "listProjectModels",
  "summary": "List all ML models belonging to a specific project. Returns model metadata including name, framework, version, deployment status, and creation date. Supports filtering by deployment status and model framework.",
  "parameters": [
    {
      "name": "status",
      "in": "query",
      "description": "Filter by deployment status. Use 'deployed' for models currently serving traffic, 'draft' for models not yet deployed, 'archived' for decommissioned models.",
      "schema": {
        "type": "string",
        "enum": ["deployed", "draft", "archived", "all"],
        "default": "all"
      }
    },
    {
      "name": "framework",
      "in": "query",
      "description": "Filter by ML framework. Common values: pytorch, tensorflow, sklearn, onnx.",
      "schema": { "type": "string" }
    }
  ]
}

The differences matter: explicit operation IDs that encode the resource and action, descriptions that explain what the endpoint returns (not just what it does), enum values with descriptions for each option, and default values so agents do not need to guess. We measured the impact: agent task completion rate on our API went from 61% to 84% after this rewrite, with zero changes to the underlying implementation. The remaining 16% of failures are mostly cases where the task requires multi-step reasoning that current agent frameworks handle poorly, not API design issues.

One additional pattern that helped: adding x-agent-hint extension fields to our OpenAPI spec. These are free-form strings that provide usage context that would not belong in a standard description but help agents make better decisions:

{
  "operationId": "deployModel",
  "x-agent-hint": "Call listProjectModels first to get the model ID. Deployment takes 30-120 seconds. Use getDeploymentStatus to poll for completion. Do not call this if the model status is already 'deployed'."
}

Idempotency Is Non-Negotiable

Agents retry. They retry because they are uncertain about whether a previous call succeeded. They retry because their context window rolled over and they lost track of what they already did. They retry because the orchestration framework (LangChain, CrewAI, AutoGen) has built-in retry logic that fires on ambiguous responses. If your API is not idempotent for write operations, agents will create duplicate resources, send duplicate payments, or trigger duplicate notifications.

We saw this in production within the first week of enabling agent access to our API. An agent was tasked with creating a project and deploying a model. It created the project successfully, but the model deployment timed out (the deployment was still running; the agent’s HTTP timeout was too short). The agent interpreted the timeout as a failure, retried the entire sequence, and created a duplicate project. The user ended up with two identical projects, each with a deployed model. This happened 23 times before we implemented idempotency.

We require an Idempotency-Key header on all POST endpoints. The agent (or its orchestration framework) generates a unique key for each logical operation, and our API deduplicates requests with the same key within a 24-hour window:

// Server-side idempotency implementation
import { Redis } from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);
const IDEMPOTENCY_TTL = 86400; // 24 hours

async function handleIdempotentRequest(
  req: Request,
  handler: () => Promise<Response>
): Promise<Response> {
  const idempotencyKey = req.headers.get('Idempotency-Key');
  if (!idempotencyKey) {
    return new Response(
      JSON.stringify({
        error: {
          code: 'MISSING_IDEMPOTENCY_KEY',
          message: 'Idempotency-Key header is required for POST requests',
          retryable: false,
          suggestion: 'Generate a UUID v4 and pass it as the Idempotency-Key header'
        }
      }),
      { status: 400 }
    );
  }

  const cacheKey = `idempotency:${req.url}:${idempotencyKey}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    const { statusCode, body, headers } = JSON.parse(cached);
    return new Response(body, {
      status: statusCode,
      headers: { ...headers, 'X-Idempotent-Replayed': 'true' }
    });
  }

  // Execute the actual handler
  const response = await handler();
  const responseBody = await response.text();

  // Cache the response
  await redis.setex(cacheKey, IDEMPOTENCY_TTL, JSON.stringify({
    statusCode: response.status,
    body: responseBody,
    headers: Object.fromEntries(response.headers.entries())
  }));

  return new Response(responseBody, {
    status: response.status,
    headers: response.headers
  });
}

The X-Idempotent-Replayed: true header tells the agent that this response is a replay of a previous identical request, not a new execution. Well-designed agent frameworks use this signal to avoid double-counting results. Since implementing idempotency, duplicate resource creation incidents from agents dropped from 23 per week to zero.

Structured Errors That Agents Can Act On

Human developers read error messages and interpret them. Agents need machine-readable error structures that include enough information to decide what to do next: retry, try a different approach, or give up and ask the user for help.

The standard HTTP error response — a status code and a text message — is insufficient for agents. A 400 error with the message “Invalid request” tells an agent nothing about what to fix. A 429 error without a Retry-After header tells an agent to retry but not when. A 500 error does not tell an agent whether the operation partially completed.

Our error responses follow a structure that gives agents three pieces of information: what went wrong (machine-readable code), why it went wrong (human-readable message for logging and display), and what to do about it (actionable suggestions):

// Error response structure for agent consumption
{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. You have made 102 requests in the current 60-second window. The limit is 100 requests per minute.",
    "retryable": true,
    "retryAfterSeconds": 23,
    "suggestion": "Wait 23 seconds before retrying. Consider batching multiple operations into a single request using the /batch endpoint.",
    "documentation": "https://docs.harborsoftware.com/rate-limits"
  }
}

// Validation error with field-level detail and fix suggestions
{
  "error": {
    "code": "VALIDATION_FAILED",
    "message": "Request body contains invalid fields.",
    "retryable": false,
    "fields": [
      {
        "path": "config.maxTokens",
        "code": "OUT_OF_RANGE",
        "message": "maxTokens must be between 1 and 4096. Received: 10000.",
        "suggestion": "Set maxTokens to 4096 for the maximum allowed value.",
        "allowedRange": { "min": 1, "max": 4096 }
      },
      {
        "path": "config.model",
        "code": "INVALID_ENUM",
        "message": "Unknown model 'gpt5'. Must be one of: gpt-4o, gpt-4o-mini, claude-3-5-sonnet.",
        "suggestion": "Use 'gpt-4o' for the closest match to your request.",
        "allowedValues": ["gpt-4o", "gpt-4o-mini", "claude-3-5-sonnet"]
      }
    ]
  }
}

The retryable boolean is the most impactful field. Before we added it, agents would retry non-retryable errors (like validation failures) up to their retry limit, wasting tokens and time. After adding it, pointless retry loops dropped by 89%. The suggestion field is consumed by agents that include error context in their next LLM call — it gives the model a concrete action to take instead of guessing. The allowedValues and allowedRange fields let agents self-correct without needing to make an additional API call to discover valid values.

Batch Endpoints Reduce Token Costs

Every API call an agent makes costs tokens — tokens for the tool call definition, tokens for the response, and tokens for the agent’s reasoning about the response. An agent that needs to create 10 resources will make 10 API calls, consuming roughly 10x the tokens of a single batch call. At scale, this matters. We had a customer whose agent workflow created project configurations with 15 individual API calls. At $0.01 per 1K tokens with GPT-4 Turbo, each workflow execution cost $0.23 just in tool-calling tokens. After we added a batch endpoint, the same workflow cost $0.04.

// Batch endpoint — process multiple operations in one request
POST /api/v1/batch
Content-Type: application/json
Idempotency-Key: batch-abc123

{
  "operations": [
    {
      "id": "op1",
      "method": "POST",
      "path": "/api/v1/models",
      "body": { "name": "sentiment-v3", "framework": "pytorch" }
    },
    {
      "id": "op2",
      "method": "POST",
      "path": "/api/v1/models",
      "body": { "name": "ner-v2", "framework": "spacy" }
    },
    {
      "id": "op3",
      "method": "PUT",
      "path": "/api/v1/projects/123/config",
      "body": { "defaultModel": "sentiment-v3" },
      "dependsOn": ["op1"]
    }
  ]
}

// Response — individual results for each operation
{
  "results": [
    { "id": "op1", "status": 201, "body": { "id": "model_abc", "name": "sentiment-v3" } },
    { "id": "op2", "status": 201, "body": { "id": "model_def", "name": "ner-v2" } },
    { "id": "op3", "status": 200, "body": { "defaultModel": "sentiment-v3" } }
  ]
}

The dependsOn field is critical for agent workflows. It lets the agent express operation ordering within a batch without making sequential HTTP requests. Operation op3 will only execute after op1 succeeds, because it references a resource created by op1. If op1 fails, op3 is skipped and returned with a DEPENDENCY_FAILED status. This dependency model maps naturally to how agents think about multi-step tasks: “create these things, then configure that thing to use them.”

Long-Running Operations Need Polling, Not Webhooks

Agents are ephemeral. They run in a function call, complete their task, and terminate. They cannot receive webhooks because they do not have a persistent HTTP endpoint. For long-running operations (model training, data exports, bulk operations), agents need a polling pattern with clear status transitions:

// Start a long-running operation
POST /api/v1/models/abc/deploy

// Response: operation handle with polling instructions
{
  "operationId": "op_xyz789",
  "status": "pending",
  "estimatedDurationSeconds": 120,
  "pollUrl": "/api/v1/operations/op_xyz789",
  "pollIntervalSeconds": 10
}

// Agent polls at suggested interval
GET /api/v1/operations/op_xyz789

// In-progress response
{
  "operationId": "op_xyz789",
  "status": "running",
  "progress": 0.65,
  "message": "Deploying model to 3 regions. 2 of 3 complete.",
  "startedAt": "2025-05-02T10:30:00Z",
  "estimatedCompletionAt": "2025-05-02T10:32:30Z"
}

// Completed response
{
  "operationId": "op_xyz789",
  "status": "completed",
  "result": {
    "modelId": "model_abc",
    "deploymentUrl": "https://inference.harborsoftware.com/models/abc",
    "regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
  },
  "completedAt": "2025-05-02T10:32:15Z"
}

The pollIntervalSeconds field prevents agents from hammering the status endpoint. Without it, agents default to polling every 1-2 seconds, which wastes both API resources and agent tokens. The progress float (0.0 to 1.0) and estimatedCompletionAt timestamp give agents enough information to decide whether to wait or move on to other tasks while the operation completes.

We also provide a cancel endpoint for long-running operations. Agents sometimes start operations that turn out to be unnecessary (the user clarifies their intent mid-conversation, or the agent realizes it started the wrong operation). Without a cancel endpoint, the operation runs to completion and consumes resources needlessly. Our cancel endpoint accepts the operation ID and returns immediately with the cancellation status:

POST /api/v1/operations/op_xyz789/cancel

{
  "operationId": "op_xyz789",
  "status": "cancelling",
  "message": "Cancellation requested. The operation will stop within 30 seconds."
}

Rate Limiting for Agents: Token-Based, Not Request-Based

Traditional API rate limiting counts requests per time window: “100 requests per minute.” This makes sense when requests have roughly uniform cost. For AI agent consumers, request cost varies wildly. An agent might make 50 simple GET requests in a minute (total cost: negligible) or 3 POST requests that each trigger expensive backend operations (total cost: significant). Request-based rate limiting either blocks cheap requests unnecessarily or allows expensive requests through unchecked.

We switched to a token-based rate limiting model where each API operation has a “cost” measured in tokens (not LLM tokens — rate limit tokens), and each consumer gets a token budget per time window. Simple read operations cost 1 token, writes cost 5 tokens, and expensive operations (model deployment, bulk export) cost 50 tokens. A consumer with a budget of 1,000 tokens per minute can make 1,000 reads, 200 writes, or 20 deployments per minute. This more accurately reflects the actual resource cost of different operations.

// Rate limit response headers for token-based limiting
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000          // Total token budget per minute
X-RateLimit-Remaining: 847       // Tokens remaining in current window
X-RateLimit-Reset: 1714636800    // Unix timestamp when budget resets
X-RateLimit-Cost: 5              // Tokens consumed by this request

// When exceeded:
HTTP/1.1 429 Too Many Requests
{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Token budget exceeded. 0 of 1000 tokens remaining.",
    "retryable": true,
    "retryAfterSeconds": 34,
    "costOfAttemptedRequest": 5,
    "suggestion": "Wait 34 seconds for budget reset, or reduce request frequency. GET operations cost 1 token; consider batching write operations."
  }
}

The X-RateLimit-Cost header tells agents how much each operation costs, which lets them budget their remaining tokens intelligently. A well-designed agent will check its remaining budget before making expensive operations and defer them if the budget is low. Without this header, agents have no way to predict whether their next request will succeed or trigger a rate limit error.

We also expose a GET /api/v1/rate-limit endpoint that returns the current rate limit state without consuming any tokens. Agents can check their budget at the start of a task sequence and plan their operations accordingly. This prevents the common pattern where an agent makes 10 successful requests, hits the rate limit on request 11 (which happens to be the most important request in the sequence), and has to report a partial failure to the user.

Versioning for Agents Is Stricter Than for Humans

When you change a REST API that humans consume, you can add a migration guide, send a deprecation email, and give developers 6 months to update. When agents consume your API, the “developer” is an LLM that was trained on a specific version of your schema. If you change field names, remove endpoints, or alter response shapes, every agent using the old schema breaks instantly with no migration path other than updating the tool definitions in every agent that uses your API.

Our versioning policy for agent-consumed APIs is stricter than our human API policy:

Never remove fields from responses. Add new fields freely, but never remove existing ones. Agents that parse specific fields from responses will break if those fields disappear.
Never change field types. If count was an integer, it stays an integer forever. Do not change it to a string, even if “it is technically backward compatible because JSON parsers coerce types.” Agent tool-calling frameworks validate types strictly.
Deprecate with parallel endpoints. When you need to make breaking changes, create a new endpoint version (/v2/models) and keep the old one running for at least 12 months. Agents built with the old schema continue to work; new agents use the new schema.
Include a schema version in the OpenAPI spec. This lets agent frameworks detect when the schema they were built with differs from the current live schema and prompt the user to update.
Never change enum values. If a status field accepts “active”, “inactive”, and “pending”, adding “paused” is fine. Renaming “inactive” to “disabled” will break every agent that uses the old value. Add the new value as an alias and keep the old one working.

AI agents are the fastest-growing category of API consumer, and they reward APIs that are self-documenting, idempotent, structured, and stable. The patterns that make APIs good for agents — clear descriptions, machine-readable errors, batch operations, polling for async work — also make APIs better for human developers. Designing for the agent consumer raises the bar for everyone.