Skip links

OpenAI vs Anthropic vs Open Source: Choosing Your LLM Provider

Choosing an LLM provider is one of the most consequential technical decisions you will make in 2023. It affects your cost structure, latency profile, compliance posture, and feature velocity for years to come. It is also a decision that most teams make badly, either by defaulting to OpenAI because it is the most familiar or by chasing the latest benchmark leader without considering operational requirements.

Article Overview

OpenAI vs Anthropic vs Open Source: Choosing Your LLM Pro…

7 sections · Reading flow

01
OpenAI: The Default Choice and When It Is Wrong
02
Anthropic Claude: The Underrated Alternative
03
Open Source: Control at a Cost
04
Our Recommended Architecture: Multi-Provider…
05
The Evaluation Framework
06
Future-Proofing Your Decision
07
Practical Recommendations by Use Case

HARBOR SOFTWARE · Engineering Insights

We have run production workloads across OpenAI, Anthropic, and several open-source models at Harbor Software. Here is a practical, opinionated comparison based on real usage, production incidents, and actual cost data rather than synthetic benchmarks.

OpenAI: The Default Choice and When It Is Wrong

OpenAI has the largest ecosystem, the most tutorials, and the broadest model selection. GPT-4 is genuinely excellent for complex reasoning tasks. GPT-3.5-turbo is fast and cheap for simpler tasks. The function calling API is the most mature implementation of structured tool use available. If you are building your first LLM feature, OpenAI is the path of least resistance.

Where OpenAI excels in our production experience:

  • Code generation. GPT-4 produces better code than any other model we have tested, particularly for complex multi-file changes and architectural reasoning. It understands code context deeply and generates syntactically correct, idiomatic code in most languages. We measured a 23% higher pass rate on our code generation eval suite compared to Claude 2.
  • Function calling. The structured output via function definitions is reliable and well-documented. JSON mode with response_format: { type: "json_object" } works consistently and eliminates an entire class of parsing errors. No other provider has an equivalent that is as reliable.
  • Ecosystem breadth. Every LLM tool, framework, and tutorial defaults to OpenAI. This reduces onboarding time for new developers. When you Google an LLM integration question, the first 10 results use OpenAI. This matters for team velocity.
  • Batch API. For offline processing, the batch API offers 50% cost reduction with 24-hour turnaround. This is a significant advantage for workloads that can tolerate latency.
  • Model variety. From GPT-3.5-turbo for cheap classification to GPT-4 for complex reasoning, plus embeddings, whisper for audio, and DALL-E for images. One provider, one billing relationship, one set of API patterns.

Where OpenAI falls short in production:

  • Content refusals. GPT-4 refuses requests that are clearly benign with frustrating frequency. We have seen it refuse to generate fictional medical scenarios for a healthcare training application, refuse to write marketing copy that mentions competitors by name, and refuse to summarize news articles about violence for a media monitoring tool. Each refusal is a production failure that requires prompt engineering workarounds. We track refusal rate as a metric, and it hovers around 2.5% for our mixed workload.
  • Rate limits. Even on paid tiers, the rate limits are aggressive. At scale, you will hit them and need to implement retry logic with exponential backoff. The tier system (Tier 1 through Tier 5) means you are gated on spending history to unlock higher limits. A new account processing 10,000 requests per day will hit rate limits immediately.
  • Pricing volatility. OpenAI changes pricing frequently. GPT-4 launched at $0.03/$0.06 per 1K tokens (input/output). Planning a budget around these prices is difficult when they could change quarterly. We build in a 20% pricing buffer for financial planning.
  • Downtime. OpenAI has had multiple significant outages in 2023, including a 4-hour outage in November that affected all API customers. If your application depends entirely on OpenAI, your uptime is capped at their uptime. We measured 99.7% uptime over 6 months, which sounds good until you calculate that 0.3% is 22 hours of downtime.

Concrete pricing as of August 2023 for reference:

GPT-4:           $0.03 / 1K input tokens,  $0.06 / 1K output tokens
GPT-4-32K:       $0.06 / 1K input tokens,  $0.12 / 1K output tokens
GPT-3.5-turbo:   $0.0015 / 1K input tokens, $0.002 / 1K output tokens
GPT-3.5-turbo-16K: $0.003 / 1K input tokens,  $0.004 / 1K output tokens
ada-002 embeddings: $0.0001 / 1K tokens

Anthropic Claude: The Underrated Alternative

Anthropic’s Claude has improved dramatically over the past year. Claude 2 is competitive with GPT-4 on most tasks and superior on several. Anthropic’s approach to safety is more nuanced than OpenAI’s, resulting in fewer frustrating refusals for legitimate use cases. If you have dismissed Claude as “the other one,” it is time to re-evaluate.

Where Claude excels in our testing:

  • Long context. Claude supports 100K tokens natively. This is not a gimmick. For document analysis, legal review, code review across large files, and conversation-heavy applications, 100K context is transformative. GPT-4’s 8K context (or 32K at 2x the price) is genuinely limiting for many production use cases. We have a contract analysis feature that processes 40-page documents in a single call with Claude. With GPT-4, we had to chunk the document and make 5 separate calls, losing cross-reference context.
  • Instruction following. Claude is remarkably good at following complex, multi-part instructions. If your prompt says “respond in exactly this format, never include explanations, always include these three fields, and never use the word ‘however’,” Claude does it more consistently than GPT-4. We measured 97% format compliance for Claude vs 91% for GPT-4 on a structured extraction benchmark.
  • XML handling. Claude processes XML-structured prompts extremely well. If your domain involves structured data or if you prefer to structure your prompts with explicit boundaries, Claude handles it natively without the JSON gymnastics that other models sometimes require.
  • Fewer content refusals. For legitimate business applications, Claude refuses less often than GPT-4. It understands context better and does not over-trigger safety filters on benign content. Our refusal rate with Claude is 0.8% compared to 2.5% with GPT-4.
  • Conversational quality. For chat-style applications where natural, helpful responses matter, Claude produces warmer, more nuanced responses. This is subjective but consistently reflected in user preference studies we have conducted.

Where Claude falls short:

  • Code generation quality. Claude 2 produces good code but noticeably worse than GPT-4 for complex programming tasks. The gap is particularly evident in multi-step reasoning about code architecture and in generating correct code for less common languages and frameworks.
  • Ecosystem maturity. Fewer tutorials, fewer community integrations, fewer Stack Overflow answers. You will be relying more on official documentation and your own experimentation. This adds friction for junior developers on your team.
  • API features. No equivalent to OpenAI’s function calling as of this writing. You need to handle tool use through prompt engineering rather than API-level support. This works well (Claude follows XML-based tool schemas reliably) but requires more engineering effort upfront.
  • Streaming reliability. We have experienced more streaming interruptions with Claude than with OpenAI. Approximately 1 in 500 streaming requests disconnects mid-stream, requiring retry logic. The API is less battle-tested at high scale.

Open Source: Control at a Cost

Open-source models (Llama 2, Mistral, Falcon) offer something the commercial APIs cannot: complete control over your infrastructure, data, and costs. But this control comes with significant engineering overhead that is easy to underestimate.

Where open source makes compelling sense:

  • Data sovereignty. If your data cannot leave your infrastructure (healthcare under HIPAA, finance under SOX, government under FedRAMP), open-source models are often your only option. No API call means no data transmitted to a third party, no data processing agreement needed, no compliance risk from a provider’s changing terms of service.
  • Cost at extreme scale. Once you amortize the infrastructure cost, the per-token cost of self-hosted models approaches zero. At millions of requests per day, this is compelling. Our calculations show the crossover point where self-hosting becomes cheaper than API calls is around 10-15 million tokens per day for a 7B model on an A100 GPU, including the fully loaded cost of the engineer maintaining the infrastructure.
  • Fine-tuning for your domain. Fine-tuning on your domain data can produce a specialized model that outperforms GPT-4 on your specific task while running on hardware you control. This requires significant ML engineering expertise but delivers real results for narrow domains. We fine-tuned a Llama 2 7B model for a specific classification task and it outperformed GPT-4 by 8% while running at 50x lower cost.
  • Latency control. A self-hosted model on local GPUs has no network round-trip. For latency-sensitive applications (real-time autocomplete, inline suggestions), sub-50ms inference is achievable with optimized serving but impossible with remote APIs.

Where open source creates real operational pain:

  • Quality gap on general tasks. Llama 2 70B is impressive but does not match GPT-4 or Claude 2 on complex reasoning tasks. For straightforward classification and extraction, the gap is small (1-3%). For multi-step reasoning, creative generation, and nuanced understanding, the gap is large (15-25% on our eval suites).
  • Operational burden. Running GPU inference at scale requires expertise in CUDA, model serving frameworks (vLLM, TGI, TensorRT-LLM), load balancing, health monitoring, GPU procurement, and capacity planning. This is a full-time job for at least one senior engineer. If that person leaves, you have a critical knowledge gap.
  • No safety net. If your self-hosted model produces harmful, biased, or factually incorrect output, there is no provider-level content filter. You build your own safety layer or accept the risk. This is a significant liability concern that your legal team will care about.
  • Model updates. When a new model version releases, you need to download it, benchmark it, fine-tune it (if applicable), deploy it, and monitor the transition. With an API provider, you change a model string.

Self-Hosting Cost Breakdown

Llama 2 70B on AWS:
- Instance: p4d.24xlarge (8x A100 40GB)
- On-demand: $32.77/hour = $23,594/month
- 1-year reserved: ~$15,000/month
- Engineering time (0.5 FTE): ~$8,000/month (fully loaded)
- Total self-hosted: ~$23,000/month

- Throughput: ~50 tokens/second per concurrent request
- At 20 concurrent requests: ~1M tokens/minute = 1.44B tokens/day

Equivalent GPT-4 API cost for 1.44B tokens/day:
- Input (assuming 50/50 split): 720M tokens * $0.03/1K = $21,600/day
- Output: 720M tokens * $0.06/1K = $43,200/day
- Total: $64,800/day = $1,944,000/month

Self-hosting saves: ~$1.92M/month at this scale

Break-even point (where self-hosting becomes cheaper):
- ~15M tokens/day for Llama 2 70B
- ~5M tokens/day for Llama 2 7B (cheaper instance)

These numbers are simplified but directionally correct. The savings are real at scale, but most teams are not operating at this scale. Below 10M tokens per day, the API is almost certainly cheaper when you factor in engineering time, operational risk, and the opportunity cost of that engineer working on product features instead of infrastructure.

Our Recommended Architecture: Multi-Provider with Routing

The right answer for most production systems is not a single provider. It is a routing layer that directs requests to the optimal provider based on the task characteristics, cost constraints, and availability requirements. This is not theoretical; we run this architecture in production.

class LLMRouter {
  constructor(providers) {
    this.providers = providers; // { openai, anthropic, local }
    this.healthChecks = new Map();
  }

  async route(request) {
    const { task, priority, maxCost, maxLatency, inputTokens } = request;

    // Failover: if primary is down, use backup
    const available = this.getAvailableProviders();

    // Classification tasks: use cheapest available
    if (task === 'classification') {
      if (available.local) return this.providers.local;
      return this.providers.openai.model('gpt-3.5-turbo');
    }

    // Long document analysis: Claude's 100K context
    if (inputTokens > 8000) {
      if (available.anthropic) return this.providers.anthropic.model('claude-2');
      // Fallback: chunk for GPT-4
      return { provider: this.providers.openai.model('gpt-4'), strategy: 'chunk' };
    }

    // Code generation: GPT-4
    if (task === 'code_generation') {
      return this.providers.openai.model('gpt-4');
    }

    // Latency-sensitive: use local model or GPT-3.5-turbo
    if (maxLatency && maxLatency < 2000) {
      if (available.local) return this.providers.local;
      return this.providers.openai.model('gpt-3.5-turbo');
    }

    // Default: GPT-3.5-turbo for cost efficiency
    return this.providers.openai.model('gpt-3.5-turbo');
  }

  getAvailableProviders() {
    // Health check each provider, mark unavailable if error rate > 5%
    return Object.fromEntries(
      Object.entries(this.providers)
        .filter(([name, _]) => this.isHealthy(name))
    );
  }
}

This is a simplified example, but the principle is sound. Different models have different strengths. A routing layer lets you exploit those strengths while managing cost and maintaining availability even when a provider has an outage.

The Evaluation Framework

Do not choose a provider based on benchmarks, blog posts, or Twitter hype. Evaluate on your actual workload with your actual data. Here is the framework we use for every provider evaluation:

  1. Collect 500+ real examples from your application domain. Label the expected outputs with your domain experts. Include easy cases, hard cases, and adversarial cases.
  2. Run each provider against your full test set. Measure accuracy (exact match and fuzzy), latency (p50, p95, p99), and total cost for the evaluation run.
  3. Test edge cases explicitly: very long inputs, adversarial inputs, multilingual content, ambiguous queries, empty inputs, inputs with special characters.
  4. Measure reliability over a full week of continuous testing: error rates, timeout rates, rate limit hits, response consistency (same input, same output?).
  5. Calculate total cost of ownership including engineering time for integration, SDK maintenance, prompt adaptation per provider, and monitoring overhead. Per-token pricing is only one component.

This evaluation takes about two weeks of engineering time. It is worth it. The wrong provider choice costs months of workarounds, prompt engineering hacks, and reliability firefighting.

Future-Proofing Your Decision

The LLM landscape is changing quarterly. Models that are state-of-the-art today will be surpassed within months. New providers will emerge. Pricing will drop. The most important architectural decision is not which provider you choose today, but how easily you can switch tomorrow.

Abstraction layers help but add complexity. Our recommendation: use a thin abstraction that normalizes the message format and response parsing, but do not try to abstract away provider-specific features (function calling, XML mode, etc.). Accept that switching providers requires some prompt rewriting, and invest in the evaluation infrastructure that makes the rewrite safe and measurable.

The LLM provider market will consolidate and mature over the next 2-3 years. Prices will drop as competition intensifies. Quality will converge as smaller companies catch up to the frontier. The teams that build provider-agnostic evaluation infrastructure today will be best positioned to capitalize on whatever comes next, switching to better or cheaper models with confidence rather than anxiety.

Practical Recommendations by Use Case

Based on running production workloads across all three provider categories, here are our specific recommendations for common use cases:

  • Customer support classification and routing: Start with GPT-3.5-turbo. It handles 90%+ of classification tasks correctly at 20x lower cost than GPT-4. Escalate ambiguous cases to GPT-4 using a cascade router. Expected cost: $0.002-0.005 per classification.
  • Document Q&A and RAG: Use Claude for its 100K context window. Load the full document in a single call instead of chunking. For shorter documents under 8K tokens, GPT-4 with function calling provides more structured responses.
  • Code generation and review: GPT-4 remains the clear winner for code quality. Use it selectively for complex generation tasks and GPT-3.5-turbo for simple code modifications, formatting, and documentation generation.
  • Content generation at scale: Start with GPT-3.5-turbo for first drafts. Use GPT-4 for editorial refinement of the top 20% of content pieces. This produces near-GPT-4 quality at 80% lower cost.
  • Regulated industries with data sovereignty: Self-host Llama 2 or Mistral. Accept the quality gap and compensate with fine-tuning on domain-specific data. The compliance benefits outweigh the quality costs.
  • Real-time user-facing features: GPT-3.5-turbo or self-hosted models for latency-sensitive paths. GPT-4’s latency (3-10 seconds for complex responses) is too slow for inline suggestions and autocomplete features.

The common thread across all recommendations: no single provider is best for everything. The winning strategy is always a portfolio approach with intelligent routing based on task requirements, quality thresholds, cost constraints, and latency budgets.

Conclusion

The LLM provider landscape will look very different in 12 months. New models will launch. Prices will drop. Features like function calling and long context that currently differentiate providers will become table stakes. The decision you make today is not permanent, but the architecture you build around that decision will persist for years.

Build for multi-provider from the start, even if you only use one provider initially. Invest in evaluation infrastructure that lets you measure provider performance on your specific workload, not on generic benchmarks. And keep your abstraction layer thin enough that switching providers is a one-week project, not a three-month rewrite. The provider you choose matters less than your ability to change that choice when the landscape shifts.

Leave a comment

Explore
Drag