Skip links

Cost Optimization for LLM-Powered Applications

LLM API costs are the cloud computing bill of the AI era. They start small during development, grow linearly during pilot programs, and explode exponentially when you ship to production traffic. We have seen teams go from $500/month during prototyping to $50,000/month within weeks of launch, with no corresponding increase in revenue or user value. The difference between a profitable AI feature and a money-losing one is almost always cost optimization, not model quality or feature completeness.

Article Overview

Cost Optimization for LLM-Powered Applications

7 sections · Reading flow

01
Measure Before You Optimize
02
Model Routing: The Biggest Single Lever
03
Prompt Optimization: Trim the Fat
04
Caching: The Free Money Optimization
05
Output Token Management
06
Batch Processing for Non-Real-Time Workloads
07
The Optimization Roadmap

HARBOR SOFTWARE · Engineering Insights

At Harbor Software, we have reduced LLM costs by 60-85% across multiple production applications without meaningful quality degradation. This is not about using worse models or degrading the user experience. It is about eliminating waste, routing intelligently, and caching effectively. Here are the specific techniques that deliver the biggest savings, ranked by impact.

Measure Before You Optimize

You cannot optimize what you do not measure. Before implementing any cost reduction technique, you need visibility into where your tokens are going, which features are most expensive, and where the optimization opportunities are largest. This measurement phase typically reveals that 20% of your features consume 80% of your LLM budget.

Instrument every LLM call with comprehensive metrics:

interface LLMCallMetrics {
  requestId: string;
  timestamp: number;
  model: string;
  feature: string;          // Which product feature triggered this call
  taskType: string;         // classification, generation, extraction, etc.
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  costUSD: number;           // Calculated from token counts and current model pricing
  latencyMs: number;
  cacheHit: boolean;
  retryCount: number;
  userId: string;            // For per-user cost analysis
  success: boolean;          // Did the downstream validation pass?
}

const MODEL_PRICING: Record = {
  'gpt-4': { input: 0.03, output: 0.06 },
  'gpt-4-turbo': { input: 0.01, output: 0.03 },
  'gpt-3.5-turbo': { input: 0.0015, output: 0.002 },
  'claude-2': { input: 0.008, output: 0.024 },
};

function calculateCost(model: string, usage: { prompt_tokens: number; completion_tokens: number }): number {
  const pricing = MODEL_PRICING[model];
  if (!pricing) return 0;
  return (usage.prompt_tokens / 1000 * pricing.input) +
         (usage.completion_tokens / 1000 * pricing.output);
}

async function trackedLLMCall(params: LLMCallParams): Promise {
  const start = Date.now();
  const response = await llm.chat(params);

  const metrics: LLMCallMetrics = {
    requestId: generateId(),
    timestamp: start,
    model: params.model,
    feature: params.metadata.feature,
    taskType: params.metadata.taskType,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    totalTokens: response.usage.total_tokens,
    costUSD: calculateCost(params.model, response.usage),
    latencyMs: Date.now() - start,
    cacheHit: false,
    retryCount: 0,
    userId: params.metadata.userId,
    success: true
  };

  await metricsStore.record(metrics);
  return response;
}

After one week of instrumentation, you will have answers to critical questions: Which features consume the most tokens? What is the average prompt size by feature? What percentage of calls could use a cheaper model without quality loss? Where are tokens being wasted on verbose system prompts or over-fetched context? Which users are generating the most cost?

Build a dashboard that shows daily cost broken down by feature, model, and task type. This dashboard becomes your optimization roadmap. The feature with the highest cost and the most room for improvement gets optimized first.

Model Routing: The Biggest Single Lever

The single most effective cost optimization is using cheaper models for easier tasks. GPT-4 costs 20x more than GPT-3.5-turbo per token. Claude Opus costs 15x more than Claude Haiku. Most applications send every single request to their most expensive model by default because the developer chose GPT-4 during prototyping and never revisited the decision.

A model router evaluates each request based on task characteristics and routes it to the cheapest model that can handle it reliably:

const MODEL_TIERS = {
  cheap:     { model: 'gpt-3.5-turbo', inputCost: 0.0015, outputCost: 0.002, quality: 'good' },
  mid:       { model: 'gpt-4-turbo',   inputCost: 0.01,   outputCost: 0.03,  quality: 'great' },
  expensive: { model: 'gpt-4',         inputCost: 0.03,   outputCost: 0.06,  quality: 'best' }
};

function selectModel(request: LLMRequest): ModelConfig {
  // Task-based routing: most impactful routing decision
  switch (request.taskType) {
    // Simple classification: cheap model handles 95%+ as well as GPT-4
    case 'classify':
    case 'sentiment':
    case 'extract_entity':
    case 'detect_language':
      return MODEL_TIERS.cheap;

    // Moderate complexity: mid-tier balances cost and quality
    case 'summarize':
    case 'rewrite':
    case 'translate':
      return MODEL_TIERS.mid;

    // Complex reasoning: worth paying for the best model
    case 'code_generation':
    case 'complex_analysis':
    case 'multi_step_reasoning':
      return MODEL_TIERS.expensive;

    default:
      return MODEL_TIERS.mid; // Default to mid-tier, not expensive
  }
}

// Advanced: cascade routing - try cheap first, escalate if confidence is low
async function cascadeRoute(request: LLMRequest): Promise {
  // First attempt with cheap model
  const cheapResponse = await llm.chat({
    ...request,
    model: MODEL_TIERS.cheap.model
  });

  // Check if we're confident in the cheap model's answer
  // (This check is task-specific - for classification, check output format;
  // for extraction, check if required fields are populated)
  if (isConfidentResponse(cheapResponse, request.taskType)) {
    return cheapResponse;
  }

  // Escalate to expensive model for uncertain cases
  metrics.increment('model.escalation', { from: 'cheap', to: 'expensive' });
  return llm.chat({
    ...request,
    model: MODEL_TIERS.expensive.model
  });
}

We implemented model routing for a customer support application. Before routing, 100% of requests went to GPT-4. After analyzing one week of metrics, we found that 55% of requests were simple classification or entity extraction tasks that GPT-3.5-turbo handled perfectly. Another 30% were moderate tasks where GPT-4-turbo was sufficient. Only 15% genuinely benefited from GPT-4.

The result: monthly cost dropped from $42,000 to $14,000. Customer satisfaction scores were statistically unchanged because the quality-sensitive tasks still used GPT-4. This single optimization saved $336,000 annually.

Prompt Optimization: Trim the Fat

Every token in your prompt costs money. Verbose prompts are expensive prompts. After model routing, prompt optimization is the second-highest-impact lever.

Audit and Trim System Prompts

System prompts accumulate instructions over time. Each edge case that triggers a bug gets a new rule added to the system prompt. After six months, your system prompt is 2,000 tokens of accumulated instructions, many of which are redundant, contradictory, or addressing issues that were fixed elsewhere.

Audit your system prompt quarterly using your evaluation suite. For each instruction, remove it and run the eval. If accuracy does not drop, the instruction is not contributing. We typically find that 30-40% of system prompt instructions can be removed without measurable impact. On a prompt that runs 100,000 times per day, removing 500 tokens saves: 500 tokens * 100,000 calls * $0.03/1K tokens = $1,500/day = $45,000/month.

Compress Retrieved Context in RAG

RAG applications are the biggest source of token waste. A typical setup retrieves 5 chunks of 500 tokens each, adding 2,500 tokens to every request. But relevance is not binary. Some chunks are highly relevant, others are marginal, and some are irrelevant noise that was above the similarity threshold.

async function optimizedRetrieval(query: string, options: RetrievalOptions = {}): Promise {
  const { maxContextTokens = 1500, minRelevanceScore = 0.75 } = options;

  // 1. Over-retrieve candidates (cheap operation)
  const candidates = await vectorStore.query(query, { topK: 10 });

  // 2. Filter by minimum relevance score
  const relevant = candidates.filter(c => c.score >= minRelevanceScore);
  logger.debug(`Filtered ${candidates.length} candidates to ${relevant.length} relevant chunks`);

  // 3. Rerank to find the truly best ones (optional, adds ~10ms)
  const reranked = await reranker.rerank(query, relevant, { topK: 3 });

  // 4. Summarize long chunks to save tokens
  const optimizedChunks = await Promise.all(
    reranked.map(async (chunk) => {
      const tokenCount = estimateTokens(chunk.text);
      if (tokenCount > 300) {
        // Summarize long chunks with cheap model
        const summary = await llm.chat({
          model: 'gpt-3.5-turbo',
          messages: [{
            role: 'user',
            content: `Summarize the following text in 2-3 sentences, preserving key facts:nn${chunk.text}`
          }],
          max_tokens: 150
        });
        return summary.content;
      }
      return chunk.text;
    })
  );

  return optimizedChunks.join('nn');
}

This pipeline reduces context tokens by 40-60% compared to naive top-K retrieval. The reranking step is optional but highly effective. Cohere Rerank costs $1 per 1,000 queries, which is far cheaper than the tokens you save by filtering irrelevant chunks.

Optimize Few-Shot Examples

Few-shot examples are effective but expensive. Each example adds both an input and an output to the prompt, typically 100-200 tokens each. Three examples add 300-600 tokens per request.

Optimizations: Use 1-2 examples instead of 3-5 and measure the quality impact with your eval suite. Use the shortest possible examples that demonstrate the pattern. Most importantly, dynamically select the most relevant example based on the input rather than including a fixed set. A single highly relevant example often outperforms three generic ones while costing 66% less.

Caching: The Free Money Optimization

Semantic caching is the highest-ROI optimization for applications with repetitive queries. If users frequently ask similar questions about the same content, cache the results and serve them without making an LLM call.

import { createHash } from 'crypto';

class LLMCache {
  private redis: Redis;
  private ttlSeconds: number;

  constructor(redis: Redis, ttlSeconds: number = 3600) {
    this.redis = redis;
    this.ttlSeconds = ttlSeconds;
  }

  private hashKey(messages: Message[], model: string): string {
    const content = JSON.stringify({ messages, model });
    return `llm:v1:${createHash('sha256').update(content).digest('hex')}`;
  }

  async get(messages: Message[], model: string): Promise {
    const cached = await this.redis.get(this.hashKey(messages, model));
    if (cached) {
      metrics.increment('llm.cache_hit');
      return JSON.parse(cached);
    }
    metrics.increment('llm.cache_miss');
    return null;
  }

  async set(messages: Message[], model: string, response: LLMResponse): Promise {
    const key = this.hashKey(messages, model);
    await this.redis.setex(key, this.ttlSeconds, JSON.stringify({
      content: response.content,
      usage: response.usage,
      cachedAt: Date.now()
    }));
  }
}

// Integration into the LLM call pipeline
const cache = new LLMCache(redis, 3600); // 1 hour TTL

async function cachedLLMCall(params: LLMCallParams): Promise {
  // Only cache deterministic calls (temperature 0)
  if (params.temperature === 0) {
    const cached = await cache.get(params.messages, params.model);
    if (cached) return cached;
  }

  const response = await trackedLLMCall(params);

  if (params.temperature === 0) {
    await cache.set(params.messages, params.model, response);
  }

  return response;
}

Important: caching only works reliably with temperature 0. Any non-zero temperature means the model may produce different responses for identical inputs, making cached responses potentially stale or inconsistent.

For semantic caching (matching similar but not identical queries), embed the incoming query and check for nearest neighbors in a small cache index. If the cosine similarity exceeds 0.95, serve the cached response. This catches paraphrased versions of the same question: “What’s the return policy?” and “How do I return an item?” are different strings but the same question.

Our production cache hit rates range from 15% (diverse queries in customer support) to 65% (repetitive queries in document analysis). Even a 15% hit rate on a $50,000/month bill saves $7,500/month with zero quality degradation.

Output Token Management

Input tokens get all the optimization attention, but output tokens are often 2x more expensive per token. And output token count is directly influenced by your prompt instructions.

  • Set max_tokens explicitly. If you expect a 100-token response, set max_tokens: 150. Without this limit, the model may generate 500 tokens of unnecessary elaboration, “helpful” context, or repetitive rephrasing.
  • Instruct conciseness explicitly. “Respond in 2-3 sentences” is more effective than “Be concise.” Specific length instructions produce shorter, cheaper output than vague quality directives.
  • Use structured output. JSON responses are typically more concise than natural language responses for the same information. {"category": "billing", "priority": "high"} is 10 tokens. “This appears to be a high-priority billing-related inquiry that should be routed to the billing department” is 22 tokens. Same information, half the tokens, double the parseability.
  • Avoid chain-of-thought for simple tasks. CoT prompting makes the model explain its reasoning, which generates many output tokens. For simple classification and extraction tasks, the reasoning is unnecessary. Use CoT only for tasks where the reasoning actually improves accuracy.

Batch Processing for Non-Real-Time Workloads

Both OpenAI and Anthropic offer batch APIs with 50% cost reduction. The trade-off is latency: results are delivered within 24 hours instead of in real time. If your workload can tolerate this latency, batch processing halves your bill for that workload.

// OpenAI Batch API usage
import fs from 'fs';

// 1. Prepare batch file (JSONL format)
const batchRequests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4',
    messages: [{ role: 'user', content: `Summarize: ${doc.content}` }],
    max_tokens: 200
  }
}));
fs.writeFileSync('batch.jsonl', batchRequests.map(r => JSON.stringify(r)).join('n'));

// 2. Upload and create batch
const file = await openai.files.create({ file: fs.createReadStream('batch.jsonl'), purpose: 'batch' });
const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h'
});

// 3. Poll for completion (or use webhook)
let status = await openai.batches.retrieve(batch.id);
while (status.status !== 'completed' && status.status !== 'failed') {
  await sleep(60000); // Check every minute
  status = await openai.batches.retrieve(batch.id);
}

// 4. Download results
if (status.status === 'completed') {
  const results = await openai.files.content(status.output_file_id);
  // Process results
}

We use batch processing for all non-real-time workloads: nightly content moderation sweeps, weekly customer segmentation analysis, monthly report generation, and document processing pipelines. This accounts for about 30% of our total LLM token usage and saves $8,000-12,000/month at current volumes because every token is processed at half price.

The Optimization Roadmap

Implement these optimizations in order of impact, spending one week on each:

  1. Instrument everything (Week 1) – Add metrics to every LLM call. Build the cost dashboard. Identify the top 3 most expensive features. This costs nothing and provides the data for all subsequent optimizations.
  2. Model routing (Week 2) – Route cheap tasks to cheap models. Typical savings: 50-70%. This is the biggest single lever and the lowest risk change (you can A/B test and roll back instantly).
  3. Caching (Week 3) – Add exact-match caching with Redis. Typical savings: 15-40% depending on query diversity and repetition patterns.
  4. Prompt optimization (Week 4) – Audit system prompts, compress RAG context, optimize few-shot examples. Typical savings: 20-40% through token reduction.
  5. Batch processing (Week 5) – Move non-real-time workloads to batch API. 50% savings on eligible workloads.
  6. Output token management (Ongoing) – Set max_tokens, enforce conciseness, use structured output. 10-20% savings from output optimization.

The cumulative effect of all these optimizations is typically a 60-85% reduction in total LLM costs. On a $50,000/month bill, that is $30,000-42,500 in monthly savings, or $360,000-510,000 annually. These are real numbers from our production applications, not theoretical projections.

Conclusion

LLM cost optimization is not about finding a single silver bullet. It is about systematically reducing waste across every dimension: using cheaper models for easier tasks, caching repetitive queries, compressing prompts and context, controlling output verbosity, and batching non-real-time workloads. Each technique delivers 10-50% savings on its own. Applied together, they compound to 60-85% total cost reduction.

Start with measurement. You cannot optimize what you cannot see. Implement model routing first because it delivers the largest savings with the least risk to quality. Add caching and prompt optimization iteratively, always validating quality metrics against your eval suite before deploying each change.

The companies building sustainable AI businesses are not the ones with the most impressive demos or the most advanced models. They are the ones who have figured out how to deliver AI-powered features profitably at scale.

A common objection we hear is “prices will drop, so we don’t need to optimize.” Prices will drop, but usage will grow faster than prices fall. Every new AI-powered feature you ship increases your token consumption. Every new user multiplies your cost. The cost optimization techniques in this post are not temporary workarounds that become unnecessary when prices drop. They are permanent engineering disciplines that ensure your AI features remain profitable as you scale, regardless of provider pricing changes.

Cost optimization is not a nice-to-have that you address someday when the bill gets uncomfortable. It is the difference between a feature that scales to millions of users profitably and one that gets shut down when the CFO reviews the cloud bill. Build the measurement infrastructure first, implement model routing second, and add the remaining optimizations as you scale. Your future self, and your finance team, will thank you.

Leave a comment

Explore
Drag