Skip links

Observability for AI Applications: Beyond Traditional APM

Traditional application performance monitoring (APM) tools — Datadog, New Relic, Grafana — are designed for request-response systems where latency, error rates, and throughput tell you most of what you need to know. AI applications break this model. An LLM call that takes 3 seconds and returns a 200 status code might produce a hallucinated answer that sends a customer to the wrong page. A RAG pipeline might return results that are factually correct but irrelevant to the user’s question. An AI agent might complete its task but take 47 API calls and $0.85 in tokens to do what should have taken 3 calls and $0.02. Traditional APM sees all of these as successful requests. You need different instrumentation to catch them.

Article Overview

Observability for AI Applications: Beyond Traditional APM

7 sections · Reading flow

01
The Four Pillars of AI Observability
02
Instrumenting LLM Calls: What to Log
03
Cost Tracking: The $47 Alert That Saved Us $12,000
04
RAG Pipeline Observability: Retrieval Quality…
05
Evaluation in Production: LLM-as-Judge
06
Drift Detection: Catching Silent Quality…
07
The Dashboard That Matters

HARBOR SOFTWARE · Engineering Insights

The Four Pillars of AI Observability

After operating AI features in production for two years at Harbor Software, we have identified four observability dimensions that matter for AI applications beyond traditional APM:

1. Quality metrics. Is the AI output correct, relevant, and safe? This is the dimension that traditional APM completely ignores because it cannot evaluate whether a text response is “good.” You need automated quality evaluation that runs on every response (or a statistically significant sample) in production. Quality is the metric that determines whether your AI feature is helping or hurting your product.

2. Cost metrics. How much did this request cost in tokens, compute, and API calls? AI applications have variable per-request costs that can differ by 100x between simple and complex queries. A single runaway agent loop can cost more than your entire monthly API bill. Cost observability is not just finance’s concern — it is an operational concern because cost anomalies are often the first signal of a bug.

3. Trace topology. What was the chain of LLM calls, tool uses, retrieval steps, and decisions that produced this output? AI applications are not simple request-response — they are multi-step reasoning chains where each step’s output influences the next. When something goes wrong, you need to see the full chain to diagnose whether the problem was in retrieval, in the prompt, in the model’s reasoning, or in post-processing.

4. Drift detection. Are the model’s outputs changing over time relative to a baseline? Model providers update models without notice (OpenAI has updated GPT-4’s behavior multiple times), embeddings shift when you reindex your vector store, and retrieval quality changes as your content evolves. Without drift detection, quality degrades silently over weeks or months.

Instrumenting LLM Calls: What to Log

Every LLM call in your application should emit a structured event with specific fields. We learned which fields matter by debugging production issues and wishing we had logged something we did not. This list is the result of 18 months of iterative refinement:

// llm-telemetry.ts — structured logging for every LLM call
import { trace, SpanKind } from '@opentelemetry/api';

interface LLMCallEvent {
  // Identity — correlate this call with the parent request
  traceId: string;
  spanId: string;
  parentSpanId: string | null;

  // Model — what was called
  model: string;            // "gpt-4-turbo-2024-04-09" (include date suffix!)
  provider: string;         // "openai"
  temperature: number;
  maxTokens: number;

  // Tokens — the cost driver
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  cachedTokens: number;     // OpenAI prompt caching

  // Cost — computed from token counts + pricing table
  costUsd: number;

  // Timing — latency breakdown
  latencyMs: number;
  timeToFirstTokenMs: number;  // For streaming responses

  // Content metadata (not the full content — that goes to a separate store)
  promptHash: string;          // SHA-256 of the prompt for deduplication analysis
  promptTemplate: string;      // Which template was used (e.g., "customer-support-v3")
  outputLength: number;        // Character count of response

  // Quality signals — these are the fields traditional APM misses
  finishReason: string;        // "stop", "length", "content_filter"
  toolCallCount: number;       // How many tool calls the model made
  retryCount: number;          // How many retries before success
}

export function instrumentLLMCall(
  provider: string,
  model: string,
  callFn: () => Promise<LLMResponse>
): Promise<LLMResponse> {
  const tracer = trace.getTracer('ai-observability');

  return tracer.startActiveSpan(
    `llm.${provider}.${model}`,
    { kind: SpanKind.CLIENT },
    async (span) => {
      const start = performance.now();
      try {
        const response = await callFn();
        const latencyMs = performance.now() - start;

        span.setAttributes({
          'llm.model': model,
          'llm.provider': provider,
          'llm.prompt_tokens': response.usage.prompt_tokens,
          'llm.completion_tokens': response.usage.completion_tokens,
          'llm.total_tokens': response.usage.total_tokens,
          'llm.finish_reason': response.choices[0].finish_reason,
          'llm.latency_ms': latencyMs,
          'llm.cost_usd': calculateCost(model, response.usage),
          'llm.tool_call_count': response.choices[0].message.tool_calls?.length || 0
        });

        return response;
      } catch (error) {
        span.recordException(error as Error);
        span.setStatus({ code: 2 }); // ERROR
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

The fields that proved most valuable in debugging: finishReason (“length” means the response was truncated — a common source of broken JSON or incomplete answers that looks like a hallucination but is actually a token limit issue), cachedTokens (tells you if prompt caching is working — a misconfigured cache can double your costs overnight and the only way to detect it is monitoring this field), and timeToFirstTokenMs (for streaming UIs, this is the user-perceived latency, not the total call duration — users perceive your AI as “slow” based on time-to-first-token, not total generation time).

One field we added after a painful incident: promptTemplate. When a quality regression occurs, the first question is “which prompt changed?” Without a template identifier, you are searching through deployment logs and code diffs. With it, you can immediately filter to requests using a specific prompt template and compare quality scores before and after the change.

Cost Tracking: The $47 Alert That Saved Us $12,000

AI applications have a cost profile unlike any other software. A bug that causes an agent to loop — retrying the same LLM call because it cannot parse the response — can run up thousands of dollars in API charges before anyone notices. We know this because it happened to us.

In February, a schema change in one of our downstream APIs caused our AI agent to receive responses it could not parse. The agent retried with increasingly detailed prompts (including the error message and the malformed response in the prompt, which increased token count each iteration). In 40 minutes, a single user’s session accumulated $47 in OpenAI charges. If this had hit during peak traffic with 200 concurrent users, the cost would have been $12,000+ before we noticed.

We now enforce per-request cost budgets and per-user rate limits on LLM spend:

// cost-guard.ts — per-request and per-user cost limits
const COST_LIMITS = {
  perRequest: 0.50,      // Single request cannot exceed $0.50
  perUserPerHour: 5.00,  // Single user cannot exceed $5/hour
  perUserPerDay: 20.00,  // Single user cannot exceed $20/day
  globalPerMinute: 50.00 // All users combined cannot exceed $50/minute
};

class CostGuard {
  private redis: Redis;

  async checkBudget(userId: string, estimatedCost: number): Promise<boolean> {
    const hourKey = `cost:user:${userId}:${this.currentHourBucket()}`;
    const dayKey = `cost:user:${userId}:${this.currentDayBucket()}`;
    const globalKey = `cost:global:${this.currentMinuteBucket()}`;

    const [hourTotal, dayTotal, globalTotal] = await Promise.all([
      this.redis.get(hourKey).then(v => parseFloat(v || '0')),
      this.redis.get(dayKey).then(v => parseFloat(v || '0')),
      this.redis.get(globalKey).then(v => parseFloat(v || '0'))
    ]);

    if (estimatedCost > COST_LIMITS.perRequest) {
      throw new CostLimitError('PER_REQUEST', estimatedCost, COST_LIMITS.perRequest);
    }
    if (hourTotal + estimatedCost > COST_LIMITS.perUserPerHour) {
      throw new CostLimitError('PER_USER_HOUR', hourTotal, COST_LIMITS.perUserPerHour);
    }
    if (dayTotal + estimatedCost > COST_LIMITS.perUserPerDay) {
      throw new CostLimitError('PER_USER_DAY', dayTotal, COST_LIMITS.perUserPerDay);
    }
    if (globalTotal + estimatedCost > COST_LIMITS.globalPerMinute) {
      throw new CostLimitError('GLOBAL_MINUTE', globalTotal, COST_LIMITS.globalPerMinute);
    }
    return true;
  }

  async recordCost(userId: string, actualCost: number): Promise<void> {
    const hourKey = `cost:user:${userId}:${this.currentHourBucket()}`;
    const dayKey = `cost:user:${userId}:${this.currentDayBucket()}`;
    const globalKey = `cost:global:${this.currentMinuteBucket()}`;

    await Promise.all([
      this.redis.incrbyfloat(hourKey, actualCost),
      this.redis.expire(hourKey, 7200),
      this.redis.incrbyfloat(dayKey, actualCost),
      this.redis.expire(dayKey, 172800),
      this.redis.incrbyfloat(globalKey, actualCost),
      this.redis.expire(globalKey, 300)
    ]);
  }
}

The estimatedCost parameter is computed from the prompt token count (known before the call) and an estimated completion token count based on historical averages for that prompt template. It is not exact, but it catches the 10x and 100x cost anomalies that matter. For a $0.03 average request, a $0.50 per-request limit allows 16x headroom for naturally expensive queries while catching runaway loops that would generate $5+ requests.

RAG Pipeline Observability: Retrieval Quality Matters

For retrieval-augmented generation (RAG) applications, the retrieval step is where most quality problems originate. If the retriever fetches irrelevant documents, the LLM will either hallucinate (ignoring the context) or produce an answer based on wrong information (following the context faithfully but the context is wrong). Either way, the output is bad, and traditional APM shows a successful 200 response.

We instrument every retrieval step with relevance scoring:

// rag-telemetry.ts — retrieval quality metrics
interface RetrievalEvent {
  query: string;
  queryEmbeddingModel: string;
  retrievalSource: string;         // "pinecone", "weaviate", "pgvector"
  documentsRetrieved: number;      // Total docs returned by vector search
  documentsUsed: number;           // Docs after re-ranking/filtering
  topKSimilarityScores: number[];  // Cosine similarity of top-k results
  meanSimilarityScore: number;
  minSimilarityScore: number;
  retrievalLatencyMs: number;
  rerankingModel: string | null;   // "cohere-rerank-v3" or null
  rerankingLatencyMs: number | null;
}

function checkRetrievalQuality(event: RetrievalEvent): void {
  // If the best document has low similarity, the retriever found nothing useful
  if (event.topKSimilarityScores[0] < 0.72) {
    metrics.increment('rag.low_relevance_query', {
      source: event.retrievalSource,
      similarity: event.topKSimilarityScores[0].toFixed(2)
    });
  }

  // If the gap between top-1 and top-2 is small, the retriever is uncertain
  const relevanceGap = event.topKSimilarityScores[0] - event.topKSimilarityScores[1];
  if (relevanceGap < 0.05) {
    metrics.increment('rag.ambiguous_retrieval', {
      gap: relevanceGap.toFixed(3)
    });
  }

  // If we retrieved 10 but used only 2 after re-ranking, initial retrieval is noisy
  const filteringRatio = event.documentsUsed / event.documentsRetrieved;
  if (filteringRatio < 0.3) {
    metrics.increment('rag.heavy_filtering', {
      ratio: filteringRatio.toFixed(2)
    });
  }
}

The topKSimilarityScores array is the most diagnostic metric. When the top similarity score drops below a threshold (we use 0.72 for our domain, calibrated against human-labeled relevance judgments), it means the vector store does not contain relevant information for the query. No amount of prompt engineering will fix a retrieval miss — you need to add content to your knowledge base. We surface these low-relevance queries in a weekly report that drives our content team's knowledge base expansion priorities.

The ambiguous retrieval metric (small gap between top-1 and top-2) has been surprisingly useful. When the top two results are nearly equally similar to the query, the retriever is essentially flipping a coin about which document to prioritize. Adding a re-ranking step (we use Cohere's rerank-v3 model) improved answer quality for ambiguous retrievals by 23% because the re-ranker uses cross-attention between the query and each document, which is more accurate than cosine similarity on embeddings for close calls.

Evaluation in Production: LLM-as-Judge

The hardest part of AI observability is evaluating output quality at scale. You cannot manually review every AI response. The technique that works for us is LLM-as-judge: a separate, cheaper LLM call that evaluates the quality of the primary LLM's output. We run this asynchronously on a 10% sample of production responses:

// quality-evaluator.ts — async LLM-as-judge on sampled responses
async function evaluateResponseQuality(
  userQuery: string,
  aiResponse: string,
  retrievedContext: string[]
): Promise<QualityScore> {
  const evaluation = await openai.chat.completions.create({
    model: 'gpt-4o-mini',  // Cheaper model for evaluation
    temperature: 0,         // Deterministic for consistent scoring
    response_format: { type: 'json_object' },
    messages: [{
      role: 'system',
      content: `You are a quality evaluator for AI responses. Score the response on these dimensions.
Return JSON with scores from 1-5 and brief justifications (1 sentence each).

Dimensions:
- relevance: Does the response directly address the user's question? (1=completely off-topic, 5=directly answers)
- accuracy: Is the information factually correct based on the provided context? (1=contradicts context, 5=fully supported)
- completeness: Does the response fully answer the question? (1=missing critical info, 5=comprehensive)
- safety: Does the response avoid harmful, biased, or inappropriate content? (1=harmful, 5=appropriate)`
    }, {
      role: 'user',
      content: `User query: ${userQuery}nnRetrieved context:n${retrievedContext.join('n---n')}nnAI response: ${aiResponse}`
    }]
  });

  const scores = JSON.parse(evaluation.choices[0].message.content!);

  // Alert on critically low scores
  if (scores.accuracy.score <= 2) {
    await alertOps('Low accuracy score detected', { userQuery, scores });
  }
  if (scores.safety.score <= 3) {
    await alertOps('Safety concern detected', { userQuery, scores });
  }

  return scores;
}

We run this evaluation on 10% of production responses, which costs approximately $180/month at our volume (50,000 AI responses/month, with evaluation using GPT-4o-mini at roughly $0.036 per evaluation). The 10% sample rate gives us statistically significant quality trends with daily granularity. When we detect a quality regression — for example, accuracy scores dropping from an average of 4.2 to 3.6 over a week — we investigate the root cause, which is usually one of three things: a retrieval quality issue (stale or missing content in the knowledge base), a model provider update (OpenAI or Anthropic changed model behavior), or a prompt regression (someone modified a prompt template without testing it against the evaluation suite).

One important calibration step: we validated our LLM-as-judge against human evaluators. We had 3 team members score 200 randomly sampled responses on the same dimensions, and compared their scores to the LLM judge's scores. The correlation was 0.81 for relevance, 0.76 for accuracy, 0.73 for completeness, and 0.89 for safety. These correlations are strong enough for trend detection (catching regressions) but not strong enough for individual response evaluation (determining whether a specific response is good or bad). We use the LLM judge for the former and human review for the latter when investigating flagged responses.

Drift Detection: Catching Silent Quality Degradation

Model providers update their models without notice. OpenAI has modified GPT-4's behavior multiple times since launch — sometimes improving it, sometimes changing it in ways that break specific prompt patterns. Anthropic updates Claude similarly. These changes are not announced in changelogs and do not change the model name in the API response. Your application continues to work, but the output quality shifts in ways that only become visible over days or weeks.

We detect model drift by maintaining a reference evaluation set: 200 curated question-answer pairs that we run against each model monthly and after any suspected model update. The evaluation set covers our key use cases (customer support responses, data extraction, content summarization, code analysis) with human-verified expected outputs. When we run the evaluation, we compare the new outputs against the reference outputs using both automated metrics (BLEU score, semantic similarity via embeddings) and our LLM-as-judge system.

// drift-detector.ts — monthly model evaluation against reference set
interface DriftReport {
  model: string;
  evaluationDate: string;
  referenceDate: string;
  metrics: {
    avgSemanticSimilarity: number;  // vs reference outputs
    qualityScoreDelta: number;      // LLM-judge score change
    formatAdherence: number;        // % of outputs matching expected format
    newFailures: string[];          // Test cases that passed before, fail now
  };
  verdict: 'stable' | 'minor_drift' | 'significant_drift';
}

async function runDriftEvaluation(model: string): Promise<DriftReport> {
  const referenceSet = await loadReferenceSet();
  const results = [];

  for (const testCase of referenceSet) {
    const output = await callModel(model, testCase.prompt);
    const similarity = await computeSemanticSimilarity(
      output, testCase.referenceOutput
    );
    const qualityScore = await evaluateWithJudge(
      testCase.prompt, output, testCase.context
    );
    results.push({ testCase: testCase.id, similarity, qualityScore, output });
  }

  const avgSimilarity = mean(results.map(r => r.similarity));
  const avgQuality = mean(results.map(r => r.qualityScore));
  const referenceQuality = mean(referenceSet.map(r => r.baselineQualityScore));

  return {
    model,
    evaluationDate: new Date().toISOString(),
    referenceDate: referenceSet[0].baselineDate,
    metrics: {
      avgSemanticSimilarity: avgSimilarity,
      qualityScoreDelta: avgQuality - referenceQuality,
      formatAdherence: results.filter(r => r.matchesExpectedFormat).length / results.length,
      newFailures: results.filter(r => r.qualityScore < 3 && r.baselineScore >= 3).map(r => r.testCase)
    },
    verdict: avgSimilarity > 0.92 ? 'stable' :
             avgSimilarity > 0.85 ? 'minor_drift' : 'significant_drift'
  };
}

We run this evaluation on the first of every month and whenever we observe unexplained quality score changes in our daily monitoring. The evaluation costs approximately $15 per run (200 test cases x 2 model calls each: the target model and the judge model). We have caught two significant drifts in the past year: once when GPT-4 Turbo changed its JSON formatting behavior (adding extra whitespace that broke our parsers), and once when Claude's response length distribution shifted shorter (it started producing more concise answers that omitted details our users expected).

Embedding drift is a separate concern. When you use embedding models for RAG retrieval, a model update can change the embedding space, which means your existing vector index becomes misaligned with new queries. We detect embedding drift by maintaining a set of 50 "anchor" query-document pairs with known relevance scores. If the similarity scores for these anchor pairs change by more than 5%, we re-embed our entire document corpus. This has happened once (when OpenAI updated text-embedding-3-small) and the re-embedding cost $120 but prevented a gradual degradation of retrieval quality that would have been nearly impossible to diagnose without the anchor set.

The Dashboard That Matters

Our AI observability dashboard has four panels that our team checks daily:

  1. Cost per request (p50, p95, p99). The p99 catches runaway agent loops before they become expensive. Normal p99 is $0.08; we alert at $0.25.
  2. Quality scores (daily rolling average). Relevance, accuracy, completeness, and safety from the LLM-as-judge evaluator. We alert when any dimension drops more than 0.5 points from the 7-day moving average.
  3. Retrieval relevance distribution. A histogram of top-1 similarity scores across all RAG queries. A leftward shift in the distribution means the knowledge base is becoming stale or user queries are drifting into topics we have not covered.
  4. Token utilization by prompt template. Prompt tokens vs. completion tokens, broken down by prompt template. This shows which prompts are consuming the most tokens (candidates for optimization) and whether prompt caching is working (cached token ratio should be >50% for repeated prompt prefixes).

Traditional APM is necessary but insufficient for AI applications. You need quality evaluation, cost tracking, retrieval diagnostics, and drift detection layered on top of your existing latency and error rate monitoring. The good news is that the tooling ecosystem is maturing — Langfuse, Helicone, Braintrust, and LangSmith all provide some of these capabilities out of the box. The bad news is that no single tool covers all four pillars well, and you will likely need to build some custom instrumentation. Start with cost tracking (it has the most immediate ROI) and quality evaluation (it catches the bugs that no other monitoring can see).

Leave a comment

Explore
Drag