Skip links

Prompt Engineering for Production Applications

Most prompt engineering advice comes from people tinkering with ChatGPT. They share tips like “be specific” and “give examples” as though these insights are revelatory. Building prompts for production applications is a fundamentally different discipline. You are not crafting a single clever instruction; you are designing a system that must handle thousands of unpredictable inputs, degrade gracefully, and produce outputs that downstream code can reliably parse.

Article Overview

Prompt Engineering for Production Applications

7 sections · Reading flow

01
The Prompt Is Not a String, It Is Architecture
02
Version Control and Regression Testing
03
Temperature, Top-P, and the Pursuit of Determinism
04
Defensive Prompt Patterns
05
Cost and Latency Management
06
The Evaluation Flywheel
07
Multi-Model Strategies

HARBOR SOFTWARE · Engineering Insights

At Harbor Software, we have shipped LLM-powered features across multiple products. The gap between a prompt that works in a playground and one that works in production is enormous. Here is what we have learned about bridging that gap, turning experimental prompts into hardened systems that run at scale without constant babysitting.

The Prompt Is Not a String, It Is Architecture

The first mistake teams make is treating the prompt as a single monolithic string buried in application code. In production, your prompt is composed of multiple layers that are assembled dynamically:

  • System message – Defines the persona, constraints, and output format. This rarely changes between requests.
  • Context injection – Retrieved documents, user history, database records. This changes on every request.
  • Few-shot examples – Carefully selected demonstrations of correct input-output pairs.
  • User input – The actual query or content being processed.
  • Output schema directive – Explicit formatting instructions, often repeated at the end for recency bias.

Each layer has its own lifecycle. System messages are version-controlled and A/B tested. Context injection depends on your retrieval pipeline. Few-shot examples are curated datasets. Treating the prompt as architecture means each component is independently testable, measurable, and deployable.

// Prompt assembly in production
const buildPrompt = ({ systemVersion, retrievedDocs, examples, userInput, schema }) => {
  return [
    { role: 'system', content: loadSystemPrompt(systemVersion) },
    ...examples.map(ex => [
      { role: 'user', content: ex.input },
      { role: 'assistant', content: ex.output }
    ]).flat(),
    { role: 'user', content: formatUserMessage(retrievedDocs, userInput, schema) }
  ];
};

This is not over-engineering. When your prompt breaks in production (and it will), you need to know which layer failed. Was the retrieval returning irrelevant documents? Did a system prompt change introduce a regression? Were the few-shot examples insufficient for this edge case? Monolithic prompts make debugging impossible.

We maintain separate repositories for prompt content and application code. Prompt changes go through their own review process, with diffs that show exactly what text changed. This separation lets product managers and domain experts contribute to prompt improvements without touching application code, and it creates a clear audit trail when something breaks.

Version Control and Regression Testing

Every production prompt needs version control, and not just in Git. You need a prompt registry that tracks which version is deployed, what changed, and how it performs against your evaluation suite.

Our approach at Harbor uses a simple but effective system. Each prompt lives in a versioned text file with metadata headers:

// prompts/classify-intent/v3.2.txt
---
version: 3.2
author: engineering
date: 2023-07-28
changes: Added edge case handling for multi-intent queries
eval_score: 0.94 (up from 0.91 in v3.1)
---
You are an intent classifier for a customer support system.
You must classify the user's message into exactly one of these categories:
- billing_inquiry
- technical_support
- account_management
- feature_request
- complaint
- general_question

Rules:
1. If the message contains multiple intents, classify by the PRIMARY intent.
2. If the primary intent is ambiguous, prefer technical_support over general_question.
3. Never output a category not in the list above.

Respond with ONLY the category name, no explanation.

The evaluation suite is the critical piece. For every prompt, we maintain a test set of at least 200 input-output pairs. These are real production examples that have been human-labeled. When we modify a prompt, we run the full evaluation before deployment. We treat prompt changes with the same rigor as code changes: no deployment without passing tests.

async function evaluatePrompt(promptVersion, testSet) {
  const results = [];
  for (const testCase of testSet) {
    const response = await llm.complete({
      messages: buildPrompt({ systemVersion: promptVersion, userInput: testCase.input }),
      temperature: 0,
      model: 'gpt-4'
    });
    results.push({
      input: testCase.input,
      expected: testCase.expected,
      actual: response.content.trim(),
      pass: response.content.trim() === testCase.expected
    });
  }
  const score = results.filter(r => r.pass).length / results.length;
  console.log(`Prompt ${promptVersion}: ${score * 100}% accuracy`);
  return { score, failures: results.filter(r => !r.pass) };
}

A prompt change that drops accuracy from 94% to 89% on your eval set will cause real pain in production. Catch it before deployment. We also run evaluations across multiple models to ensure a prompt change that improves GPT-4 performance does not degrade Claude or GPT-3.5-turbo performance if we use those for routing.

One pattern we have found invaluable is maintaining a “golden set” of 20-30 examples that represent the hardest edge cases. These are the inputs that have caused production failures in the past. Every prompt change must pass 100% of the golden set. The broader test set allows for some regression, but the golden set is non-negotiable.

Temperature, Top-P, and the Pursuit of Determinism

For classification, extraction, and structured output tasks, set temperature to 0. Always. There is no debate here. You want the most probable token sequence every time. Non-determinism in these tasks is a bug, not a feature.

For generative tasks (writing, summarization, creative content), temperature between 0.3 and 0.7 usually works. But here is the part most guides skip: even for generative tasks in production, you often want lower temperature than you think. A temperature of 0.9 produces creative output in demos. In production, it produces inconsistent output that breaks your UI assumptions about response length, formatting, and tone.

We have run controlled experiments on this. At temperature 0.9, the standard deviation of response length for the same prompt is 40-60% of the mean. At temperature 0.3, it drops to 10-15%. If your UI allocates a fixed space for an AI-generated summary, that variance matters. If your downstream processing expects a certain structure, that variance causes failures.

Top-P (nucleus sampling) is the other lever. For most production use cases, we leave top_p at 1.0 and only adjust temperature. Changing both simultaneously makes it harder to reason about behavior. Pick one knob and use it consistently across your application.

A critical nuance: temperature 0 does not mean deterministic across API calls. OpenAI and Anthropic both note that responses may vary slightly even at temperature 0 due to floating-point non-determinism in GPU operations. If you need exact reproducibility (for caching, testing, or compliance), you need to implement your own caching layer keyed on the full prompt hash. We use a SHA-256 hash of the serialized message array as the cache key, stored in Redis with a configurable TTL.

Defensive Prompt Patterns

Production prompts must handle adversarial, malformed, and edge-case inputs without catastrophic failure. Here are the patterns we use consistently across all our production prompts:

Output Anchoring

LLMs have a tendency to drift. If your system prompt says “respond in JSON” but the user message is conversational, the model may slip into conversational mode. We anchor the output by restating the format requirement immediately before the model generates:

// Instead of just the user message at the end:
const userMessage = `
Analyze the following customer review:
"${review}"

Remember: You MUST respond with valid JSON matching this exact schema:
{"sentiment": "positive|negative|neutral", "topics": ["string"], "urgency": "low|medium|high"}
JSON response:`;

That trailing “JSON response:” is not cosmetic. It anchors the model’s next token generation toward JSON output. This single trick reduced our JSON parsing failures by 73% when we first introduced it. The model treats it as a continuation prompt, making it far more likely to begin the response with an opening brace rather than explanatory text.

Input Sanitization

User inputs can contain content that confuses the model: markdown formatting, HTML tags, prompt injection attempts, extremely long text, or non-English characters when you expect English. Sanitize before injection:

function sanitizeInput(input, maxLength = 4000) {
  return input
    .replace(/```[sS]*?```/g, '[code block removed]')  // Remove code blocks
    .replace(/<[^>]*>/g, '')                               // Strip HTML
    .replace(/n{3,}/g, 'nn')                            // Normalize whitespace
    .substring(0, maxLength)                               // Enforce length limit
    .trim();
}

The length limit is particularly important. Without it, a user submitting a 50,000-character input will blow up your context window, cause an API error, and possibly charge you for the attempted tokens. Decide on a reasonable maximum for your use case and enforce it before the prompt is assembled, not after.

Graceful Degradation

When the LLM output does not match your expected format, do not crash. Have a fallback path. We layer our parsing with multiple strategies:

function parseResponse(raw) {
  // Strategy 1: Direct JSON parse
  try {
    const parsed = JSON.parse(raw);
    if (isValidSchema(parsed)) return { success: true, data: parsed };
    return { success: false, error: 'schema_mismatch', raw };
  } catch (e) {
    // Strategy 2: Extract JSON from markdown code blocks
    const jsonMatch = raw.match(/```(?:json)?n?([sS]*?)n?```/);
    if (jsonMatch) {
      try {
        const parsed = JSON.parse(jsonMatch[1]);
        if (isValidSchema(parsed)) return { success: true, data: parsed };
      } catch (e2) { /* fall through */ }
    }
    // Strategy 3: Find JSON-like substring
    const braceMatch = raw.match(/{[sS]*}/);
    if (braceMatch) {
      try {
        const parsed = JSON.parse(braceMatch[0]);
        if (isValidSchema(parsed)) return { success: true, data: parsed };
      } catch (e3) { /* fall through */ }
    }
    return { success: false, error: 'parse_failure', raw };
  }
}

In our systems, approximately 3-5% of LLM responses require the fallback path. That is not a failure; that is expected behavior you plan for. The raw response is always logged so you can analyze failure patterns and improve the prompt to reduce future fallbacks.

Cost and Latency Management

Prompt engineering is directly tied to cost. Every token in your prompt costs money, and for high-volume applications, prompt verbosity becomes a real budget line item that shows up in your P&L.

Some concrete numbers from our production systems: A prompt with 2,000 input tokens running GPT-4 at $0.03/1K input tokens costs $0.06 per call. At 100,000 calls per day, that is $6,000/day or $180,000/month. Trimming that prompt to 1,200 tokens saves $72,000/month. That is a senior engineer’s salary saved by optimizing text.

Strategies that actually work for cost reduction:

  • Use the cheapest model that meets your quality bar. GPT-3.5-turbo or Claude Haiku handles 70% of classification tasks as well as GPT-4. Route easy cases to cheap models and hard cases to expensive ones. We call this model routing and it typically reduces costs by 50-70%.
  • Minimize few-shot examples. Three examples often work as well as ten. Test this with your eval suite. Each example adds both input and output tokens to every single request.
  • Compress context. If you are injecting retrieved documents, summarize them first. A 500-token summary of a 3,000-token document loses some nuance but saves 83% of tokens per request.
  • Cache aggressively. If the same prompt produces the same output (temperature 0), cache the response. We use a Redis cache keyed on the SHA-256 hash of the full message array. Cache hit rates range from 15% to 65% depending on the application.
  • Set max_tokens explicitly. If you expect a 100-token response, set max_tokens to 150. Without this, the model may generate 500 tokens of unnecessary elaboration.

Latency follows similar patterns. Longer prompts mean more time-to-first-token. Streaming helps perceived latency for user-facing applications but does not reduce total processing time. For batch operations, use the async batch API endpoints that OpenAI and Anthropic both offer at 50% reduced pricing. We batch all non-real-time workloads (nightly content analysis, weekly reporting) through these endpoints.

The Evaluation Flywheel

The teams that get the best results from prompt engineering are not the ones with the cleverest prompts. They are the ones with the best evaluation infrastructure. Cleverness is a one-time advantage. Systematic evaluation is a compounding advantage.

The flywheel works like this:

  1. Deploy a prompt version to production with comprehensive logging.
  2. Log every input, output, and downstream outcome (did the user accept the suggestion? did the classification lead to correct routing? did the extraction pass validation?).
  3. Sample failures and edge cases weekly. Human-label them with correct outputs.
  4. Add labeled examples to your evaluation set, growing it over time.
  5. Use the expanded eval set to test prompt modifications before deployment.
  6. Deploy improved prompt version. Return to step 2.

After three months of running this flywheel, our intent classification prompt went from 82% accuracy to 96%. Not through any single clever insight, but through systematic iteration with real production data. Each cycle identified 5-10 failure patterns, and each prompt revision addressed the most common failures.

The logging infrastructure is non-negotiable. At minimum, log: the full prompt (all layers), the raw response, the parsed response, the latency, the model used, the token counts, and any downstream success/failure signal. Store it in a queryable format. We use a combination of structured logging to BigQuery and a lightweight internal tool for browsing and labeling failed responses.

The labeling tool does not need to be sophisticated. A simple web interface that shows the input, the model’s output, and a field for the correct output is sufficient. The key is making it easy enough that someone can label 50 examples in 30 minutes. If labeling is painful, it will not happen, and the flywheel stalls.

Multi-Model Strategies

Production applications should not be coupled to a single LLM provider. Your prompt engineering should account for model differences and enable graceful failover:

  • Anthropic Claude tends to be more instruction-following and less likely to refuse benign requests. It handles XML-structured prompts exceptionally well. Use XML tags like <context> and <instructions> to structure prompts for Claude.
  • OpenAI GPT-4 has stronger code generation and tends to produce more concise outputs. It handles JSON schema enforcement natively with function calling. Use function definitions for structured output.
  • Open-source models (Llama 2, Mistral) require more explicit instructions and more few-shot examples to match the quality of frontier models. They benefit from constrained decoding for structured output.

We maintain model-specific prompt variants for critical paths. The system message for Claude uses XML tags for structure; the same logical prompt for GPT-4 uses markdown headers. The semantic content is identical, but the formatting is optimized per model.

// Model-specific prompt formatting
const formatForModel = (content, model) => {
  if (model.startsWith('claude')) {
    return `n${content.instructions}nnn${content.context}nnn${content.outputFormat}n`;
  }
  if (model.startsWith('gpt')) {
    return `## Instructionsn${content.instructions}nn## Contextn${content.context}nn## Output Formatn${content.outputFormat}`;
  }
  // Open source: more verbose, include examples
  return `TASK: ${content.instructions}nnINPUT DATA:n${content.context}nnOUTPUT REQUIREMENTS:n${content.outputFormat}nnEXAMPLE OUTPUT:n${content.example}`;
};

This model-specific formatting adds maintenance overhead, but it consistently improves output quality by 5-10% per model compared to a one-size-fits-all prompt. When you are operating at scale, that improvement compounds into meaningful quality gains.

Conclusion

Prompt engineering for production is software engineering. It requires version control, testing, monitoring, and iterative improvement. The prompt itself is often the least interesting part; the infrastructure around it is what determines success or failure.

If you are building your first LLM-powered feature, invest in evaluation infrastructure before you invest in prompt cleverness. Build the logging pipeline. Create a test set from real examples. Set up the evaluation script. Then iterate on the prompt with confidence that you can measure the impact of every change.

The teams shipping the best AI features are not the ones with the most creative prompts. They are the ones who treat prompt engineering as a measurable, iteratable engineering discipline, with the same rigor they apply to code, infrastructure, and product quality. Start with the flywheel, and the prompts will improve themselves.

Leave a comment

Explore
Drag