Multi-Model AI Architectures: Routing, Fallbacks, and Cost Control
Using a single AI model for every task is the equivalent of using a single database for every data pattern. It works until it does not. GPT-4o is excellent at complex reasoning but costs 15x more than GPT-4o-mini for simple classification tasks. Claude 3.5 Sonnet produces better long-form writing than GPT-4o for our use case, but its tool-calling reliability is lower for our agent workflows. Gemini 1.5 Pro handles 1M-token contexts that no other model can match, but its structured output adherence is less consistent. Every model has strengths and weaknesses, and production AI applications that serve diverse queries need architectures that route to the right model for each request. At Harbor Software, we run a multi-model architecture that routes across 4 providers and 7 models. Here is how we built it and what we learned.
Why Single-Model Architectures Break Down
We started with GPT-4 for everything. It was simple: one API key, one SDK, one set of prompt templates, one pricing model to track. Within 3 months, we hit three problems that could not be solved by staying with a single model:
Cost concentration. Our monthly OpenAI bill was $14,200. Analysis showed that 68% of our requests were simple tasks (text classification, entity extraction, yes/no questions, summarization) that a smaller model could handle with equivalent accuracy. Routing those to GPT-4o-mini would save $7,800/month — more than the engineering cost of building the routing layer. The math was unambiguous: we were spending $9,600/month on GPT-4 for tasks where GPT-4o-mini produced identical results (we verified this by running both models on 10,000 historical requests and comparing outputs).
Single point of failure. OpenAI had three outages in one month that affected our production application. Each outage lasted 15-45 minutes. Our uptime dropped to 99.2% purely because of provider downtime — our own infrastructure was fine. With a fallback provider, we could have maintained service during outages. Our SLA with enterprise customers guarantees 99.9% uptime, and a single provider dependency made that SLA impossible to meet.
Task-specific quality gaps. GPT-4 was mediocre at generating structured JSON that exactly matched our TypeScript interfaces. It would add extra fields, omit optional fields, or produce syntactically valid but semantically incorrect structures. Claude was significantly better at this specific task — 94% schema adherence vs. GPT-4’s 81% in our evaluation suite of 500 structured output requests. Using the best model per task improved overall output quality measurably.
The Router: Classifying Requests Before Processing Them
The core of a multi-model architecture is the router — a lightweight system that examines each incoming request and decides which model should handle it. Our router is a two-stage system: a rules-based first pass for cases with clear routing logic and a classifier-based second pass for ambiguous cases.
// model-router.ts — two-stage routing
interface RoutingDecision {
model: string;
provider: string;
reason: string;
confidence: number;
}
class ModelRouter {
// Stage 1: Rules-based routing (deterministic, zero-latency)
private applyRules(request: AIRequest): RoutingDecision | null {
// Long context → Gemini (only model that handles 1M tokens)
if (request.estimatedTokens > 100_000) {
return {
model: 'gemini-1.5-pro',
provider: 'google',
reason: 'context_length_exceeds_128k',
confidence: 1.0
};
}
// Structured output with strict schema → Claude (best schema adherence)
if (request.responseFormat === 'json_schema') {
return {
model: 'claude-3-5-sonnet-20241022',
provider: 'anthropic',
reason: 'structured_output_requirement',
confidence: 0.9
};
}
// Image input → GPT-4o (best multimodal quality/cost ratio)
if (request.hasImageInput) {
return {
model: 'gpt-4o',
provider: 'openai',
reason: 'multimodal_input',
confidence: 0.85
};
}
// Explicit model override from caller → respect it always
if (request.preferredModel) {
return {
model: request.preferredModel,
provider: this.providerFor(request.preferredModel),
reason: 'user_override',
confidence: 1.0
};
}
return null; // Fall through to classifier
}
// Stage 2: Complexity classifier (lightweight ML model, not an LLM)
private async classifyComplexity(request: AIRequest): Promise<RoutingDecision> {
const features = this.extractFeatures(request);
const complexity = await this.complexityClassifier.predict(features);
if (complexity === 'simple') {
return {
model: 'gpt-4o-mini',
provider: 'openai',
reason: 'low_complexity_classification',
confidence: 0.8
};
} else if (complexity === 'medium') {
return {
model: 'claude-3-5-haiku-20241022',
provider: 'anthropic',
reason: 'medium_complexity_classification',
confidence: 0.75
};
} else {
return {
model: 'gpt-4o',
provider: 'openai',
reason: 'high_complexity_classification',
confidence: 0.7
};
}
}
async route(request: AIRequest): Promise<RoutingDecision> {
const ruleDecision = this.applyRules(request);
if (ruleDecision) return ruleDecision;
return this.classifyComplexity(request);
}
}
The rules-based stage handles cases where the routing decision is obvious: long contexts go to Gemini (it is the only option), structured output goes to Claude (measurably better), images go to GPT-4o (best multimodal). These rules are deterministic, add zero latency, and handle about 35% of our traffic. The remaining 65% goes to the complexity classifier.
The complexity classifier is a small gradient-boosted tree model (XGBoost, not an LLM — using an LLM to decide which LLM to use would be absurd both in cost and latency). It is trained on 50,000 historical request-response pairs labeled by quality evaluators. The features are lightweight and fast to compute: prompt token count, instruction word count, presence of domain-specific keywords (“analyze”, “compare”, “summarize” vs. “classify”, “extract”, “yes or no”), number of constraints in the prompt (“must include”, “do not”, “at most”), and whether the prompt contains few-shot examples. Classification takes under 2ms and adds negligible overhead to the request pipeline.
We retrain the classifier monthly using a feedback loop: when the quality evaluation system (LLM-as-judge) flags a response as low quality, we record the routing decision that led to it. If a simple-classified request produced a poor response on GPT-4o-mini, we relabel it as medium or complex and include it in the next training batch. This feedback loop has improved routing accuracy from 72% to 89% over 6 months.
The Fallback Chain: Surviving Provider Outages
Every model selection comes with a fallback chain. If the primary model fails (API error, timeout, rate limit), the request automatically routes to the next model in the chain. The fallback chain is model-specific because the best fallback depends on the task characteristics:
// fallback-config.ts — provider-specific fallback chains
const FALLBACK_CHAINS: Record<string, string[]> = {
// Complex reasoning: GPT-4o → Claude Sonnet → Gemini Pro
'gpt-4o': ['claude-3-5-sonnet-20241022', 'gemini-1.5-pro'],
// Simple tasks: GPT-4o-mini → Claude Haiku → Gemini Flash
'gpt-4o-mini': ['claude-3-5-haiku-20241022', 'gemini-1.5-flash'],
// Structured output: Claude Sonnet → GPT-4o (with strict mode) → Gemini Pro
'claude-3-5-sonnet-20241022': ['gpt-4o', 'gemini-1.5-pro'],
// Long context: Gemini Pro → Claude Sonnet (200K limit) → truncate + GPT-4o
'gemini-1.5-pro': ['claude-3-5-sonnet-20241022', 'gpt-4o']
};
async function callWithFallback(
request: AIRequest,
primaryModel: string
): Promise<AIResponse> {
const chain = [primaryModel, ...(FALLBACK_CHAINS[primaryModel] || [])];
for (let i = 0; i < chain.length; i++) {
const model = chain[i];
const provider = providerFor(model);
try {
const response = await callModel(provider, model, request, {
timeout: 30_000,
retries: 1 // One retry on the same model before falling back
});
if (i > 0) {
metrics.increment('model.fallback_used', {
primary: primaryModel,
fallback: model,
reason: 'primary_failed',
position: i
});
}
return response;
} catch (error) {
const errorType = classifyError(error);
metrics.increment('model.call_failed', { model, error: errorType });
// Don't fallback on authentication errors — they'll fail everywhere
if (errorType === 'auth_error') {
throw error;
}
if (i === chain.length - 1) {
throw new AllModelsFailedError(chain, error);
}
// Continue to next model in chain
}
}
throw new Error('Unreachable');
}
The fallback chain has maintained our AI feature uptime at 99.95% over the past 6 months, compared to 99.2% when we used a single provider. During OpenAI’s outage on March 12, our traffic automatically routed to Claude and Gemini within 2 seconds of the first failed request. Users saw no interruption — the response came from a different model, but the quality was comparable. The cost during failover periods is slightly higher (fallback models are not always the cheapest option for a given task), but the uptime improvement is worth the 5-15% cost increase during the rare failover windows.
One important detail: we use a circuit breaker pattern per provider. If a provider has failed 3 consecutive requests within a 30-second window, we “open” the circuit and skip it entirely for the next 60 seconds, going directly to the fallback. This prevents slow failures (where each request hangs for 30 seconds before timing out) from cascading into user-visible latency. Without the circuit breaker, a provider that is returning errors slowly (10-second timeout before each error) would add 10+ seconds to every request before the fallback kicks in.
Cost Control: Budget-Aware Routing
The router is not just optimizing for quality — it is optimizing for cost within quality constraints. We define cost profiles for different request types, and the router selects the cheapest model that meets the quality threshold for each profile:
// cost-aware-routing.ts
interface CostProfile {
maxCostPerRequest: number; // USD
qualityThreshold: number; // 1-5 from eval suite
latencyTargetMs: number;
}
const COST_PROFILES: Record<string, CostProfile> = {
'internal-tool': {
maxCostPerRequest: 0.005,
qualityThreshold: 3.5,
latencyTargetMs: 5000
},
'customer-facing': {
maxCostPerRequest: 0.05,
qualityThreshold: 4.0,
latencyTargetMs: 3000
},
'premium-customer': {
maxCostPerRequest: 0.20,
qualityThreshold: 4.5,
latencyTargetMs: 10000
}
};
// Model quality scores — from offline evaluation on 2,000 test cases
const MODEL_QUALITY: Record<string, number> = {
'gpt-4o': 4.5,
'claude-3-5-sonnet-20241022': 4.4,
'gemini-1.5-pro': 4.2,
'claude-3-5-haiku-20241022': 3.9,
'gpt-4o-mini': 3.8,
'gemini-1.5-flash': 3.6
};
// Model cost per 1K tokens (blended input+output estimate)
const MODEL_COSTS: Record<string, number> = {
'gpt-4o': 0.0075,
'claude-3-5-sonnet-20241022': 0.009,
'gemini-1.5-pro': 0.00525,
'claude-3-5-haiku-20241022': 0.002,
'gpt-4o-mini': 0.0003,
'gemini-1.5-flash': 0.000375
};
function selectModelWithinBudget(
estimatedTokens: number,
profile: CostProfile,
candidateModels: string[]
): string {
// Filter to models that meet quality threshold
const qualityFiltered = candidateModels.filter(
model => MODEL_QUALITY[model] >= profile.qualityThreshold
);
// Filter to models within budget
const affordable = qualityFiltered.filter(model => {
const estimatedCost = (estimatedTokens / 1000) * MODEL_COSTS[model];
return estimatedCost <= profile.maxCostPerRequest;
});
if (affordable.length === 0) {
// No model meets both quality and budget — prioritize quality
return qualityFiltered.sort(
(a, b) => MODEL_COSTS[a] - MODEL_COSTS[b]
)[0] || candidateModels[0];
}
// Among affordable models that meet quality, pick the cheapest
return affordable.sort(
(a, b) => MODEL_COSTS[a] - MODEL_COSTS[b]
)[0];
}
This system reduced our monthly AI API costs from $14,200 to $5,900 — a 58% reduction — while maintaining overall quality scores within 3% of the all-GPT-4o baseline. The savings come almost entirely from routing simple tasks to smaller models. Our top cost-saving routes: customer support classification (GPT-4o -> GPT-4o-mini, saved $2,800/month), metadata extraction (GPT-4o -> Claude Haiku, saved $1,900/month), content summarization (GPT-4o -> Gemini Flash, saved $1,400/month), and FAQ answering (GPT-4o -> GPT-4o-mini, saved $1,100/month).
Prompt Compatibility: The Hidden Cost of Multi-Model
The biggest engineering challenge of multi-model architectures is prompt compatibility. Each model has different strengths, different system prompt behaviors, different tool-calling formats, and different sensitivity to prompt structure. A prompt optimized for GPT-4o may produce suboptimal results on Claude, and vice versa. This is not a theoretical concern — we measured it. Our customer support prompt scored 4.3 on GPT-4o but only 3.7 on Claude without adaptation, because Claude interpreted certain instruction phrasings differently.
We manage this with model-specific prompt adapters that translate between our canonical prompt format and each provider’s preferred format:
// prompt-adapter.ts — model-specific prompt transformations
interface PromptAdapter {
adaptSystemPrompt(prompt: string): string;
adaptToolDefinitions(tools: ToolDef[]): any;
parseToolCalls(response: any): ToolCall[];
adaptResponseFormat(format: ResponseFormat): any;
}
class OpenAIAdapter implements PromptAdapter {
adaptSystemPrompt(prompt: string): string {
return prompt; // OpenAI handles system prompts natively
}
adaptToolDefinitions(tools: ToolDef[]): any {
return tools.map(t => ({
type: 'function',
function: {
name: t.name,
description: t.description,
parameters: t.parameters,
strict: true
}
}));
}
adaptResponseFormat(format: ResponseFormat): any {
if (format.type === 'json_schema') {
return {
type: 'json_schema',
json_schema: { name: format.name, schema: format.schema, strict: true }
};
}
return { type: 'json_object' };
}
}
class AnthropicAdapter implements PromptAdapter {
adaptSystemPrompt(prompt: string): string {
// Claude benefits from more explicit constraint language
return `${prompt}nnCritical: Follow the instructions precisely. Do not add information that was not requested. Do not omit requested information.`;
}
adaptToolDefinitions(tools: ToolDef[]): any {
return tools.map(t => ({
name: t.name,
description: t.description,
input_schema: t.parameters
}));
}
adaptResponseFormat(format: ResponseFormat): any {
// Anthropic uses tool_use pattern for guaranteed structured output
return {
tool_choice: { type: 'tool', name: format.name },
tools: [{
name: format.name,
description: `Generate output matching this exact structure: ${format.description}`,
input_schema: format.schema
}]
};
}
}
class GeminiAdapter implements PromptAdapter {
adaptSystemPrompt(prompt: string): string {
// Gemini responds well to structured system prompts with clear sections
return `## Instructionsn${prompt}nn## Output RequirementsnFollow the instructions exactly as specified.`;
}
adaptToolDefinitions(tools: ToolDef[]): any {
return tools.map(t => ({
name: t.name,
description: t.description,
parameters: { type: 'object', properties: t.parameters.properties, required: t.parameters.required }
}));
}
adaptResponseFormat(format: ResponseFormat): any {
return { responseMimeType: 'application/json', responseSchema: format.schema };
}
}
The adapter layer adds roughly 200 lines of code per provider, but it is essential. Without it, switching between models requires rewriting prompts — which defeats the purpose of dynamic routing. With adapters, the application code works with a provider-agnostic interface, and the adapter handles the translation. When we add a new model or provider, we write one adapter and all existing prompts work with it immediately.
The adapter approach also centralizes provider-specific workarounds. When Anthropic changed their tool-calling response format in a minor API update, we fixed it in one place (the AnthropicAdapter’s parseToolCalls method) rather than hunting through every prompt in the codebase.
Evaluation: Continuous Quality Monitoring Across Models
When you route different requests to different models, you need per-model quality monitoring. A quality drop on one model might be masked by stable quality on others if you only monitor aggregate scores. We run our LLM-as-judge evaluation with model-level breakdowns and alert on per-model regressions:
// evaluation-dashboard metrics — July 2025
{
"period": "2025-07-01 to 2025-07-25",
"totalRequests": 70800,
"totalCost": 4720.00,
"models": {
"gpt-4o": {
"requests": 12400,
"percentOfTraffic": 17.5,
"avgQuality": 4.3,
"avgCostUsd": 0.042,
"p95LatencyMs": 2800,
"fallbackRate": 0.3
},
"gpt-4o-mini": {
"requests": 34200,
"percentOfTraffic": 48.3,
"avgQuality": 3.9,
"avgCostUsd": 0.003,
"p95LatencyMs": 1200,
"fallbackRate": 0.1
},
"claude-3-5-sonnet": {
"requests": 8900,
"percentOfTraffic": 12.6,
"avgQuality": 4.4,
"avgCostUsd": 0.048,
"p95LatencyMs": 3100,
"fallbackRate": 0.5
},
"claude-3-5-haiku": {
"requests": 4500,
"percentOfTraffic": 6.4,
"avgQuality": 3.8,
"avgCostUsd": 0.009,
"p95LatencyMs": 1400,
"fallbackRate": 0.2
},
"gemini-1.5-flash": {
"requests": 10800,
"percentOfTraffic": 15.3,
"avgQuality": 3.7,
"avgCostUsd": 0.001,
"p95LatencyMs": 900,
"fallbackRate": 0.2
}
}
}
This dashboard surfaces three actionable signals: if a model’s average quality drops below its historical baseline by more than 0.3 points, we investigate whether to re-route its traffic to a different model; if a model’s fallback rate spikes above 2%, the provider is having reliability issues and we may temporarily remove it from primary routing; if a model’s cost per request rises unexpectedly, the provider may have changed pricing (this happened with one provider’s API update that changed how system prompt tokens were counted).
We also track a “routing efficiency” metric: the percentage of requests where the router’s model selection matched what the quality evaluation suggests was the best model for that specific request. Currently at 82%, which means 18% of requests could have been served by a better model. The gap is primarily in the medium-complexity range where the classifier is least confident. We are experimenting with a “try cheap first, escalate if quality is low” pattern for these ambiguous cases — run the request on the cheap model, evaluate the response quality with a fast heuristic (not a full LLM judge, just a structure/length/keyword check), and escalate to a more capable model if the heuristic flags potential issues.
Multi-model architectures are more complex than single-model setups. The routing layer, fallback chains, prompt adapters, and per-model monitoring add real engineering overhead — roughly 400 hours of initial development and 8 hours/month of ongoing maintenance for our setup. But the benefits — 58% cost reduction, 99.95% uptime, task-specific quality optimization — justify that overhead for any team spending more than $5,000/month on AI API costs. Start with two models (a capable model and a cheap model) and a simple rules-based router. Add complexity — more models, a classifier, prompt adapters — as your usage patterns become clear and the ROI on routing sophistication materializes. The router does not need to be perfect on day one. It needs to be measurable so you can improve it over time.