Monitoring and Alerting for AI-Powered Applications

Author Sarah Chen

Published on: September 23, 2022

Traditional monitoring assumes deterministic systems. You send the same input, you get the same output, and any deviation is a bug worth investigating. AI-powered applications break this assumption completely. A language model might return different text for the same prompt on consecutive calls. A recommendation engine might serve different results depending on recent user behavior and exploration parameters. An image classifier might produce different confidence scores depending on model version, preprocessing pipeline, and even floating-point nondeterminism across hardware.

Article Overview

Monitoring and Alerting for AI-Powered Applications

6 sections · Reading flow

01
The Four Pillars of AI Application Monitoring

→

02
Alerting Rules Specific to AI Systems

→

03
Cost Monitoring and Budget Controls

→

04
Graceful Degradation

→

05
Three Dashboards for Three Audiences

→

06
Testing AI Features in CI

HARBOR SOFTWARE · Engineering Insights

This fundamental nondeterminism makes standard monitoring practices insufficient. Standard health checks (“is it up?”), standard latency monitoring (“is it fast?”), and standard error rate tracking (“is it failing?”) are all still necessary but no longer sufficient. You also need to monitor what the AI is actually doing: is it producing reasonable outputs? Is it hallucinating? Is it drifting from expected behavior? Is it costing more than it should?

At Harbor Software, we have been integrating AI models into our products for the past year, primarily using OpenAI’s API for text generation and classification tasks. The monitoring challenges forced us to develop new patterns that go beyond the traditional three pillars of observability. Here is what we learned.

The Four Pillars of AI Application Monitoring

Standard application monitoring focuses on three pillars: metrics, logs, and traces. AI applications add a fourth: model behavior monitoring. This pillar captures what the model is actually doing with the inputs it receives and whether its outputs meet quality expectations over time.

Pillar 1: Infrastructure Metrics (Familiar but Different)

AI applications still need standard infrastructure monitoring: CPU, memory, disk, network. But the metrics that matter most are different from a typical CRUD application because the computational profile is different. A CRUD API spends most of its time waiting on database queries (IO-bound). An AI-powered API spends most of its time waiting on model inference (either IO-bound if calling an external API like OpenAI, or compute-bound if running models locally).

// Prometheus metrics for AI inference
const aiMetrics = {
  // Latency histogram with AI-appropriate buckets
  // (note: much higher than typical API latency buckets)
  inferenceDuration: new promClient.Histogram({
    name: 'ai_inference_duration_seconds',
    help: 'Duration of AI inference calls in seconds',
    labelNames: ['model', 'task', 'provider'],
    buckets: [0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0]
  }),

  // Token usage for cost tracking (OpenAI-specific)
  tokensUsed: new promClient.Counter({
    name: 'ai_tokens_used_total',
    help: 'Total tokens consumed by AI inference',
    labelNames: ['model', 'type'] // type: prompt | completion
  }),

  // Provider reliability
  providerRequests: new promClient.Counter({
    name: 'ai_provider_requests_total',
    help: 'Total requests to AI providers',
    labelNames: ['provider', 'model', 'status'] // status: success | error | rate_limited | timeout
  }),

  // Queue depth for async inference
  queueDepth: new promClient.Gauge({
    name: 'ai_inference_queue_depth',
    help: 'Number of inference requests waiting in queue',
    labelNames: ['model', 'task']
  }),

  // Cost tracking in cents
  inferCostCents: new promClient.Counter({
    name: 'ai_inference_cost_cents_total',
    help: 'Estimated cost of AI inference in cents',
    labelNames: ['model', 'task']
  })
};

AI inference calls are expensive and slow compared to traditional backend operations. A database query takes 5-50ms. A Redis lookup takes 1-5ms. An OpenAI API call takes 500-10,000ms depending on the model, prompt length, and response length. This difference has cascading implications: your latency budgets need to account for seconds not milliseconds, your timeout configurations need to be generous (we use 30 seconds for GPT-4, 15 seconds for GPT-3.5), and your circuit breaker thresholds need to tolerate higher baseline latency without tripping unnecessarily.

Pillar 2: Logs (Structured, Comprehensive, and Privacy-Aware)

Every AI inference call should produce a structured log entry that captures the full context needed for debugging and quality analysis. This is more detailed than typical API logging because AI debugging often requires understanding the relationship between input, configuration, and output:

async function generateSummary(document: Document): Promise<string> {
  const startTime = Date.now();
  const requestId = crypto.randomUUID();
  const prompt = buildSummaryPrompt(document);

  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 500,
      temperature: 0.3,
    });

    const duration = Date.now() - startTime;
    const result = response.choices[0].message.content!;

    // Structured log with all relevant context
    logger.info({
      event: 'ai_inference_complete',
      request_id: requestId,
      model: 'gpt-3.5-turbo',
      task: 'document_summary',
      duration_ms: duration,
      prompt_tokens: response.usage!.prompt_tokens,
      completion_tokens: response.usage!.completion_tokens,
      total_tokens: response.usage!.total_tokens,
      estimated_cost_cents: calculateCost('gpt-3.5-turbo', response.usage!),
      temperature: 0.3,
      max_tokens: 500,
      prompt_char_length: prompt.length,
      response_char_length: result.length,
      response_word_count: result.split(/s+/).length,
      finish_reason: response.choices[0].finish_reason,
      document_id: document.id,
      document_word_count: document.body.split(/s+/).length,
      // Privacy: hash the prompt/response instead of logging full content
      prompt_hash: crypto.createHash('sha256').update(prompt).digest('hex').slice(0, 16),
      response_hash: crypto.createHash('sha256').update(result).digest('hex').slice(0, 16),
    });

    // Record metrics
    aiMetrics.inferenceDuration.observe(
      { model: 'gpt-3.5-turbo', task: 'document_summary', provider: 'openai' },
      duration / 1000
    );
    aiMetrics.tokensUsed.inc(
      { model: 'gpt-3.5-turbo', type: 'prompt' },
      response.usage!.prompt_tokens
    );
    aiMetrics.tokensUsed.inc(
      { model: 'gpt-3.5-turbo', type: 'completion' },
      response.usage!.completion_tokens
    );

    return result;
  } catch (error: any) {
    const duration = Date.now() - startTime;

    logger.error({
      event: 'ai_inference_error',
      request_id: requestId,
      model: 'gpt-3.5-turbo',
      task: 'document_summary',
      duration_ms: duration,
      error_type: error.constructor.name,
      error_message: error.message,
      error_status: error.status,
      error_code: error.code,
      document_id: document.id,
    });

    aiMetrics.providerRequests.inc(
      { provider: 'openai', model: 'gpt-3.5-turbo', status: classifyError(error) }
    );

    throw error;
  }
}

function classifyError(error: any): string {
  if (error.status === 429) return 'rate_limited';
  if (error.code === 'ETIMEDOUT' || error.code === 'ECONNABORTED') return 'timeout';
  return 'error';
}

A critical note on privacy: in production, logging full prompts and responses creates PII risks (user data in prompts), storage cost concerns (GPT-4 responses can be several KB each), and potential compliance issues. We log content hashes for deduplication analysis, and sample 1% of requests with full content (encrypted, stored in a separate retention-limited bucket) for quality reviews. In staging, we log everything because quality debugging requires the full context.

Pillar 3: Traces (End-to-End Request Flow)

AI calls are rarely standalone. A user request to “summarize this document” involves authentication, document retrieval, prompt construction, AI inference, response validation, post-processing, and database storage. Distributed tracing shows where time is spent and helps identify bottlenecks:

// Simplified trace structure for an AI-powered endpoint
// Total: 2,485ms
//
// [auth.verify]         |--10ms--|
// [db.fetch_document]              |---45ms---|
// [ai.build_prompt]                             |--8ms--|
// [ai.inference]                                         |----------2,300ms-----------|
// [ai.validate_response]                                                                |--5ms--|
// [ai.post_process]                                                                              |--12ms--|
// [db.save_summary]                                                                                         |--25ms--|
// [cache.invalidate]                                                                                                     |--3ms--|

This trace immediately shows that AI inference is 92% of the total request time (2,300ms out of 2,485ms). The database queries and all other processing combined are a rounding error. This insight informs architectural decisions: making inference asynchronous (return immediately, process in background, notify when done) would reduce perceived latency from 2.5 seconds to under 100ms for the user, at the cost of increased system complexity.

Pillar 4: Model Behavior Monitoring

This pillar is unique to AI applications. It tracks the characteristics of model outputs over time to detect quality degradation, behavioral drift, and anomalies that infrastructure metrics cannot capture:

const behaviorMetrics = {
  // Response characteristics
  responseLength: new promClient.Histogram({
    name: 'ai_response_length_chars',
    help: 'Character length of AI responses',
    labelNames: ['model', 'task'],
    buckets: [50, 100, 250, 500, 1000, 2000, 5000]
  }),

  responseWordCount: new promClient.Histogram({
    name: 'ai_response_word_count',
    help: 'Word count of AI responses',
    labelNames: ['model', 'task'],
    buckets: [10, 25, 50, 100, 200, 500, 1000]
  }),

  // Completion reasons (stop=normal, length=truncated, content_filter=blocked)
  finishReason: new promClient.Counter({
    name: 'ai_finish_reason_total',
    help: 'Distribution of completion finish reasons',
    labelNames: ['model', 'task', 'reason']
  }),

  // Content filter activations (model refused to generate)
  contentFiltered: new promClient.Counter({
    name: 'ai_content_filtered_total',
    help: 'Responses blocked by content safety filters',
    labelNames: ['model', 'task']
  }),

  // Fallback activations (AI response replaced with fallback)
  fallbackTriggered: new promClient.Counter({
    name: 'ai_fallback_triggered_total',
    help: 'AI responses replaced with fallback content',
    labelNames: ['model', 'task', 'reason']
  }),

  // User feedback signals
  userFeedback: new promClient.Counter({
    name: 'ai_user_feedback_total',
    help: 'User ratings of AI-generated content',
    labelNames: ['model', 'task', 'rating']
  }),

  // Validation failures (response did not match expected format)
  validationFailure: new promClient.Counter({
    name: 'ai_validation_failure_total',
    help: 'AI responses that failed format/content validation',
    labelNames: ['model', 'task', 'validation_rule']
  })
};

Alerting Rules Specific to AI Systems

Standard alerting (error rate > 1%, latency P99 > 5s) still applies but is not sufficient. AI systems need alerts on cost, behavior drift, quality degradation, and provider-specific issues:

# prometheus/ai-alerts.yml
groups:
  - name: ai-operations
    rules:
      # Cost: token usage exceeding hourly budget
      - alert: AITokenBudgetExceeded
        expr: sum(increase(ai_tokens_used_total[1h])) > 500000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI token usage {{ $value | humanize }} exceeds 500K/hour budget"

      # Cost: daily spend trending high
      - alert: AIDailyCostHigh
        expr: sum(increase(ai_inference_cost_cents_total[24h])) > 10000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "AI daily cost ${{ $value | humanize }} exceeds $100 budget"

      # Latency: model responding slower than baseline
      - alert: AIInferenceLatencyHigh
        expr: |
          histogram_quantile(0.95,
            rate(ai_inference_duration_seconds_bucket[10m])
          ) > 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AI inference P95 latency {{ $value | humanize }}s exceeds 15s threshold"

      # Provider: error rate spike
      - alert: AIProviderErrorRateHigh
        expr: |
          rate(ai_provider_requests_total{status="error"}[5m])
          / rate(ai_provider_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "AI provider error rate {{ $value | humanizePercentage }} exceeds 5%"

      # Provider: rate limiting detected
      - alert: AIProviderRateLimited
        expr: increase(ai_provider_requests_total{status="rate_limited"}[5m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "AI provider rate limiting detected ({{ $value }} events in 5m)"

      # Behavior: response length drifting from baseline
      - alert: AIResponseLengthDrift
        expr: |
          abs(
            avg_over_time(ai_response_word_count_sum[1h])
            / avg_over_time(ai_response_word_count_count[1h])
            -
            avg_over_time(ai_response_word_count_sum[7d])
            / avg_over_time(ai_response_word_count_count[7d])
          ) / (
            avg_over_time(ai_response_word_count_sum[7d])
            / avg_over_time(ai_response_word_count_count[7d])
          ) > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "AI response length drifted >30% from 7-day average"

      # Quality: high fallback rate indicates systematic issues
      - alert: AIFallbackRateHigh
        expr: |
          rate(ai_fallback_triggered_total[15m])
          / rate(ai_provider_requests_total{status="success"}[15m]) > 0.10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value | humanizePercentage }} of AI responses triggering fallbacks"

The behavior drift alert deserves special attention. It compares the average response word count over the last hour against the 7-day average. A drift greater than 30% indicates something changed: a model update by the provider, a prompt change in your code, or a shift in input data characteristics. This alert has caught two provider-side model updates that changed response characteristics before we noticed them through user feedback.

Cost Monitoring and Budget Controls

AI API costs can spike unexpectedly and dramatically. A bug that generates excessively long prompts, a retry loop that resubmits failed requests, a traffic spike, or even a change in user behavior can generate a $5,000 bill in a single night. We implement both monitoring and hard budget controls:

class AIBudgetGuard {
  private hourlyBudgetCents: number;
  private dailyBudgetCents: number;
  private redis: Redis;

  constructor(hourlyBudgetCents: number, dailyBudgetCents: number) {
    this.hourlyBudgetCents = hourlyBudgetCents;
    this.dailyBudgetCents = dailyBudgetCents;
    this.redis = new Redis(process.env.REDIS_URL!);
  }

  async checkBudget(): Promise<void> {
    const hourKey = `ai_cost:hourly:${new Date().toISOString().slice(0, 13)}`;
    const dayKey = `ai_cost:daily:${new Date().toISOString().slice(0, 10)}`;

    const [hourlySpend, dailySpend] = await Promise.all([
      this.redis.get(hourKey).then(v => parseInt(v || '0')),
      this.redis.get(dayKey).then(v => parseInt(v || '0')),
    ]);

    if (dailySpend >= this.dailyBudgetCents) {
      aiMetrics.fallbackTriggered.inc({ model: 'all', task: 'all', reason: 'daily_budget' });
      throw new BudgetExceededError(
        `Daily AI budget exceeded: $${(dailySpend / 100).toFixed(2)} of $${(this.dailyBudgetCents / 100).toFixed(2)}`
      );
    }

    if (hourlySpend >= this.hourlyBudgetCents) {
      aiMetrics.fallbackTriggered.inc({ model: 'all', task: 'all', reason: 'hourly_budget' });
      throw new BudgetExceededError(
        `Hourly AI budget exceeded: $${(hourlySpend / 100).toFixed(2)} of $${(this.hourlyBudgetCents / 100).toFixed(2)}`
      );
    }
  }

  async recordUsage(model: string, usage: { prompt_tokens: number; completion_tokens: number }): Promise<void> {
    const costCents = this.calculateCost(model, usage);
    const hourKey = `ai_cost:hourly:${new Date().toISOString().slice(0, 13)}`;
    const dayKey = `ai_cost:daily:${new Date().toISOString().slice(0, 10)}`;

    await Promise.all([
      this.redis.incrby(hourKey, costCents),
      this.redis.expire(hourKey, 7200),
      this.redis.incrby(dayKey, costCents),
      this.redis.expire(dayKey, 172800),
    ]);
  }

  private calculateCost(model: string, usage: { prompt_tokens: number; completion_tokens: number }): number {
    // Pricing in cents per million tokens (as of 2022)
    const pricing: Record<string, { prompt: number; completion: number }> = {
      'gpt-4': { prompt: 300, completion: 600 },
      'gpt-3.5-turbo': { prompt: 15, completion: 20 },
    };
    const rates = pricing[model];
    if (!rates) return 0;
    return Math.ceil(
      (usage.prompt_tokens * rates.prompt + usage.completion_tokens * rates.completion) / 1_000_000
    );
  }
}

const budgetGuard = new AIBudgetGuard(5000, 50000); // $50/hour, $500/day

Graceful Degradation

Every AI-powered feature must have a fallback path for when the AI is unavailable, too slow, too expensive, or producing low-quality results. Users should get a response regardless of AI availability:

async function generateProductDescription(product: Product): Promise<string> {
  try {
    await budgetGuard.checkBudget();

    const result = await withTimeout(
      callOpenAI(buildDescriptionPrompt(product)),
      10000 // 10-second timeout
    );

    // Validate: response should be a reasonable product description
    if (result.content.length < 50) {
      behaviorMetrics.fallbackTriggered.inc({ model: 'gpt-3.5-turbo', task: 'product_desc', reason: 'too_short' });
      return getFallbackDescription(product);
    }

    if (result.content.length > 5000) {
      behaviorMetrics.fallbackTriggered.inc({ model: 'gpt-3.5-turbo', task: 'product_desc', reason: 'too_long' });
      return getFallbackDescription(product);
    }

    return result.content;
  } catch (error: any) {
    const reason = error instanceof BudgetExceededError ? 'budget'
      : error.name === 'TimeoutError' ? 'timeout'
      : 'error';

    behaviorMetrics.fallbackTriggered.inc({ model: 'gpt-3.5-turbo', task: 'product_desc', reason });
    logger.warn({ event: 'ai_fallback_used', task: 'product_desc', reason, error: error.message });

    return getFallbackDescription(product);
  }
}

function getFallbackDescription(product: Product): string {
  // Fallback chain: existing description > template > minimal
  if (product.existingDescription) return product.existingDescription;
  return `${product.name} - ${product.category}. Contact us for more details.`;
}

The fallback chain (AI response, then existing content, then template) ensures users always see something useful. The fallback rate metric is one of the most important indicators of AI system health. A rising fallback rate means AI quality or availability is degrading, and it quantifies exactly how much of your AI-powered functionality is actually working.

Three Dashboards for Three Audiences

We maintain three Grafana dashboards:

Operations Dashboard (for on-call engineers): Error rate, latency percentiles, provider availability, queue depth, active alerts. Shows the last 6 hours. Designed for incident response: what is broken and how bad is it?
Cost Dashboard (for engineering leadership): Token usage by model and task type, hourly/daily/monthly spend, cost per request, cost per user, cost per feature, budget utilization percentage. Shows the last 30 days with weekly/monthly aggregations. Designed for answering: are we spending what we expected?
Quality Dashboard (for product team): Response length distributions, finish reason distributions, fallback rate over time, user feedback positive/negative ratios, content filter trigger rate, validation failure rate. Shows the last 7 days. Designed for answering: is the AI feature working well for users?

Testing AI Features in CI

Traditional CI tests assert deterministic outcomes: given input X, expect output Y. AI outputs are nondeterministic, which makes traditional assertion-based testing impossible. You cannot assert that a summary equals a specific string because the model might phrase it differently on every run. We use three testing strategies that work within this constraint:

Contract testing: Verify the AI response matches the expected structure and constraints without asserting specific content. If the summary should be under 250 words, assert the word count. If the classification should return one of five categories, assert the value is in the allowed set. If the response should be valid JSON, parse it and assert the schema.

describe('generateSummary', () => {
  it('returns a summary within length constraints', async () => {
    const summary = await generateSummary(testDocument);
    expect(summary.length).toBeGreaterThan(50);
    expect(summary.length).toBeLessThan(2000);
    expect(summary.split(/s+/).length).toBeLessThan(250);
  });

  it('returns a summary that mentions the document topic', async () => {
    const document = { title: 'Machine Learning Basics', body: '...' };
    const summary = await generateSummary(document);
    // At least one key term should appear in the summary
    const keyTerms = ['machine learning', 'ml', 'model', 'algorithm', 'training'];
    const containsKeyTerm = keyTerms.some(term =>
      summary.toLowerCase().includes(term)
    );
    expect(containsKeyTerm).toBe(true);
  });

  it('gracefully handles empty documents', async () => {
    const summary = await generateSummary({ id: '1', title: 'Empty', body: '' });
    expect(typeof summary).toBe('string');
    // Should return fallback, not crash
  });
});

Snapshot testing with human review: For features where output quality matters but cannot be automatically verified, we record AI outputs during CI and flag them for human review when they change significantly. This is not automated pass/fail; it is a notification system that tells the team when AI behavior has changed so they can evaluate whether the change is acceptable.

Mock testing for integration logic: The business logic around AI calls (prompt construction, response parsing, error handling, fallback chains, cost tracking) is fully deterministic and should be tested with mocked AI responses. This covers the majority of code paths without making actual AI API calls, keeping tests fast and free.

describe('AI error handling', () => {
  it('falls back to existing description on timeout', async () => {
    jest.spyOn(openai, 'create').mockRejectedValue(new TimeoutError());
    const product = { name: 'Widget', existingDescription: 'A great widget.' };
    const result = await generateProductDescription(product);
    expect(result).toBe('A great widget.');
  });

  it('falls back to template when no existing description and AI fails', async () => {
    jest.spyOn(openai, 'create').mockRejectedValue(new Error('API error'));
    const product = { name: 'Widget', category: 'Tools', existingDescription: null };
    const result = await generateProductDescription(product);
    expect(result).toContain('Widget');
    expect(result).toContain('Tools');
  });

  it('increments fallback metric on AI failure', async () => {
    jest.spyOn(openai, 'create').mockRejectedValue(new Error('API error'));
    const incSpy = jest.spyOn(behaviorMetrics.fallbackTriggered, 'inc');
    await generateProductDescription(testProduct);
    expect(incSpy).toHaveBeenCalledWith(
      expect.objectContaining({ reason: 'error' })
    );
  });
});

Conclusion

Monitoring AI-powered applications requires extending your observability practice beyond traditional infrastructure metrics. Model behavior monitoring, cost tracking with hard budget controls, output quality validation, and graceful degradation are not optional enhancements. They are the mechanisms that prevent a model regression, a provider outage, or an unexpected cost spike from becoming a customer-facing incident or a budget-busting surprise.

Start with structured logging and basic cost tracking (week 1). Add Prometheus metrics for latency, token usage, error rates, and fallback triggers (week 2). Implement budget guards with hard limits and a fallback chain for every AI-powered feature (week 3). Then invest in behavior drift detection, quality validation, and the three-dashboard setup as your AI usage grows (month 2-3). Each layer gives you more visibility into a system that, by its nondeterministic nature, resists the simple pass/fail health checks that work for traditional applications.

Monitoring and Alerting for AI-Powered Applications

The Four Pillars of AI Application Monitoring

Pillar 1: Infrastructure Metrics (Familiar but Different)

Pillar 2: Logs (Structured, Comprehensive, and Privacy-Aware)

Pillar 3: Traces (End-to-End Request Flow)

Pillar 4: Model Behavior Monitoring

Alerting Rules Specific to AI Systems

Cost Monitoring and Budget Controls

Graceful Degradation

Three Dashboards for Three Audiences

Testing AI Features in CI

Conclusion

You may also like

Leave a comment Cancel reply