Structured Output from LLMs: Reliable JSON Every Time
Getting an LLM to return valid JSON sounds trivial. Ask for JSON, get JSON. In practice, LLMs produce invalid JSON at a rate between 2% and 15% depending on the model, prompt complexity, and output schema. For a feature that makes 10,000 API calls per day, even a 2% failure rate means 200 broken responses daily. At scale, this becomes the most common source of production errors in LLM-powered applications.
We have spent months at Harbor Software building systems that reliably extract structured data from LLMs. Here is everything we have learned, from simple tricks that eliminate most failures to advanced techniques for complex schemas that push reliability above 99.8%.
The Baseline: Why LLMs Fail at JSON
LLMs are next-token predictors trained on natural language. JSON is a structured format with rigid syntax rules: matching braces, quoted keys, proper comma placement, no trailing commas, properly escaped special characters. The model does not “understand” JSON syntax the way a parser does. It has learned statistical patterns about what tokens follow other tokens in JSON-like text. This means it can produce JSON that looks right but has subtle syntax errors, especially in long or deeply nested outputs.
Common failure modes we have catalogued across thousands of production failures:
- Trailing commas – The model adds a comma after the last item in an array or object. Valid JavaScript, invalid JSON. This accounts for roughly 30% of our parse failures.
- Unescaped characters – Newlines, tabs, or quotes inside string values without proper escaping. Particularly common when the model is extracting text that itself contains quotes. About 25% of failures.
- Markdown wrapping – The model wraps JSON in
```json ... ```code blocks because it was trained on markdown-formatted examples. About 20% of failures. - Commentary – The model adds explanatory text before or after the JSON, like “Here is the JSON output:” or “I hope this helps!”. About 15% of failures.
- Truncation – For long outputs, the model hits the max token limit and produces incomplete JSON with missing closing braces. About 5% of failures.
- Type confusion – Numbers rendered as strings, booleans as “true”/”false” strings, null as the literal string “null”. About 5% of failures.
Understanding these failure modes is critical because each requires a different mitigation strategy. You cannot solve them all with a single approach.
Level 1: API-Level JSON Mode
The simplest and most reliable approach is to use your provider’s built-in JSON mode. OpenAI offers response_format: { type: "json_object" } which constrains the model’s token generation to produce valid JSON. This eliminates markdown wrapping, commentary, and most syntax errors at the model level.
const response = await openai.chat.completions.create({
model: 'gpt-4-1106-preview',
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: 'You extract structured data from text. Always respond with valid JSON.'
},
{
role: 'user',
content: `Extract the following fields from this customer review:nn"${review}"nnReturn JSON with these exact keys: sentiment (one of: positive, negative, neutral), topics (array of strings, max 5), rating_mentioned (number 1-5 or null if not mentioned)`
}
]
});
const data = JSON.parse(response.choices[0].message.content);
Limitations you will encounter: JSON mode guarantees syntactically valid JSON but does not guarantee your specific schema. The model might return {"result": "positive"} instead of {"sentiment": "positive"}. It might omit fields you expected. It might add fields you did not ask for. You still need schema validation on top of JSON mode. JSON mode is necessary but not sufficient.
Another subtlety: you must mention “JSON” in your system or user message when using JSON mode with OpenAI. If you do not, the API returns an error. This is documented but easy to miss.
Level 2: Function Calling for Schema Enforcement
OpenAI’s function calling feature lets you define a JSON Schema for the expected output. The model is constrained to produce arguments matching that schema. This is the most reliable method available today for structured output from GPT models, and it handles both syntax and schema compliance.
const response = await openai.chat.completions.create({
model: 'gpt-4-1106-preview',
tools: [{
type: 'function',
function: {
name: 'extract_review_data',
description: 'Extract structured data from a customer review',
parameters: {
type: 'object',
properties: {
sentiment: {
type: 'string',
enum: ['positive', 'negative', 'neutral'],
description: 'Overall sentiment of the review'
},
topics: {
type: 'array',
items: { type: 'string' },
description: 'Key topics mentioned in the review (max 5)'
},
rating_mentioned: {
type: ['number', 'null'],
description: 'Numeric rating if explicitly mentioned, otherwise null'
},
key_quote: {
type: 'string',
description: 'The most representative sentence from the review'
}
},
required: ['sentiment', 'topics', 'rating_mentioned', 'key_quote']
}
}
}],
tool_choice: { type: 'function', function: { name: 'extract_review_data' } },
messages: [
{ role: 'user', content: `Analyze this review: "${review}"` }
]
});
const args = JSON.parse(
response.choices[0].message.tool_calls[0].function.arguments
);
Function calling gives you: guaranteed valid JSON, guaranteed schema conformance (required fields, correct types, enum constraints), and no markdown wrapping or commentary. In our production systems, function calling produces parseable, schema-valid output 99.7% of the time. The remaining 0.3% are edge cases where the model produces valid JSON that passes schema validation but has semantically wrong values (e.g., “neutral” sentiment for an obviously positive review). Those are quality issues, not structural issues.
The tool_choice parameter with { type: 'function', function: { name: 'extract_review_data' } } forces the model to call this specific function rather than choosing to respond with plain text. Without this, the model sometimes decides to respond conversationally instead of calling the function, especially for ambiguous inputs.
Level 3: Runtime Validation with Zod
Even with function calling, you need runtime validation. The model might produce valid JSON with unexpected values: empty arrays where you expected data, strings that are too long for your database column, numbers outside your expected range, or null values for fields you assumed would always be populated. Zod is the best library for this in TypeScript applications because it provides both compile-time types and runtime validation in a single definition.
import { z } from 'zod';
const ReviewSchema = z.object({
sentiment: z.enum(['positive', 'negative', 'neutral']),
topics: z.array(z.string().min(1).max(100)).min(1).max(10),
rating_mentioned: z.number().min(1).max(5).nullable(),
key_quote: z.string().min(10).max(500),
});
type ReviewData = z.infer;
function parseReviewResponse(raw: string): ReviewData | null {
try {
const parsed = JSON.parse(raw);
const validated = ReviewSchema.parse(parsed);
return validated;
} catch (error) {
if (error instanceof z.ZodError) {
console.error('Schema validation failed:', error.errors);
// Log specific field failures for monitoring
error.errors.forEach(e => {
metrics.increment('llm.schema_error', {
field: e.path.join('.'),
code: e.code,
message: e.message
});
});
} else {
console.error('JSON parse failed:', error);
metrics.increment('llm.json_parse_error');
}
return null;
}
}
The monitoring aspect is critical. Track which fields fail validation most often. This tells you where your prompt needs improvement. If rating_mentioned frequently fails because the model returns a string like “4 out of 5” instead of the number 4, you need to add explicit type instructions to your prompt for that field. If topics frequently returns an empty array, your prompt does not make it clear that at least one topic is required.
We maintain a dashboard that shows validation failure rates by field, by model, and by prompt version. This dashboard is our primary feedback loop for prompt improvement. When a field’s failure rate exceeds 1%, we investigate and update the prompt.
Level 4: Retry and Self-Healing
When validation fails, you have two options: return an error or retry. For production applications, a retry strategy with the error message fed back to the model is surprisingly effective. The model is good at self-correcting when told exactly what went wrong.
async function extractWithRetry(
input: string,
schema: z.ZodSchema,
maxRetries: number = 2
): Promise<{ data: any; attempts: number } | { error: string }> {
let lastError = '';
let lastRaw = '';
for (let attempt = 1; attempt <= maxRetries + 1; attempt++) {
const messages: Message[] = [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: input }
];
// On retry, include the error from the previous attempt
if (lastError) {
messages.push({
role: 'assistant',
content: lastRaw
});
messages.push({
role: 'user',
content: `Your previous response had a validation error: ${lastError}. Please fix the specific issue and respond with corrected JSON. Do not change anything else.`
});
}
const response = await llm.chat({
messages,
response_format: { type: 'json_object' }
});
lastRaw = response.content;
try {
const parsed = JSON.parse(response.content);
const validated = schema.parse(parsed);
metrics.histogram('llm.extraction_attempts', attempt);
return { data: validated, attempts: attempt };
} catch (error) {
lastError = error instanceof z.ZodError
? error.errors.map(e => `${e.path.join('.')}: ${e.message}`).join('; ')
: 'Invalid JSON syntax';
metrics.increment('llm.extraction_retry', { attempt: String(attempt) });
}
}
metrics.increment('llm.extraction_failed');
return { error: `Failed after ${maxRetries + 1} attempts. Last error: ${lastError}` };
}
In our experience, the first retry succeeds 85% of the time when you feed the validation error back to the model. The model is good at fixing specific issues like “topics must have at least 1 item” or “rating_mentioned must be a number, not a string.” The second retry catches another 12%. Beyond two retries, success probability drops sharply and you are better off falling back to a different strategy or routing to a human review queue.
The cost of retries is modest. A retry adds one more LLM call. For a $0.06 call, the retry costs another $0.06. If 5% of requests need a retry, the total cost increase is 5% * $0.06 = $0.003 per request on average. That is well worth the reliability improvement.
Level 5: Complex and Nested Schemas
Simple flat schemas are easy. The real challenge is complex, deeply nested schemas with conditional fields, arrays of heterogeneous objects, and cross-field validation rules. These are where LLMs struggle and where engineering effort pays off.
Flatten When Possible
Every level of nesting increases the error rate. If your schema has three levels of nesting, consider whether you can flatten it to two. Instead of {"customer": {"address": {"city": "..."}}}, try {"customer_address_city": "..."}. Yes, it is less elegant. It is also significantly more reliable. We measured a 15% reduction in extraction errors when we flattened a 3-level nested schema to 2 levels.
Split Complex Extractions
If you need to extract 20+ fields from a document, split the extraction into multiple focused calls. One call extracts contact information (5 fields). Another extracts financial data (5 fields). A third extracts metadata (5 fields). Each call is simpler, more reliable, and easier to debug when something goes wrong.
// Instead of one massive extraction:
const everything = await extract(doc, MassiveSchema); // 60% success rate
// Split into focused extractions:
const [contacts, financials, metadata] = await Promise.all([
extract(doc, ContactSchema), // 95% success rate
extract(doc, FinancialSchema), // 93% success rate
extract(doc, MetadataSchema) // 97% success rate
]);
// Combined success: 0.95 * 0.93 * 0.97 = 85.7% (vs 60%)
const combined = { ...contacts, ...financials, ...metadata };
This approach has a higher total token cost (roughly 1.5-2x) but significantly higher reliability. It also parallelizes well with Promise.all, reducing wall-clock latency compared to a single long extraction. And when one part fails, you still have the other parts, enabling partial success rather than total failure.
Provide Examples in the Prompt
For complex schemas, a single example in the prompt is worth a thousand words of description. Show the model exactly what a correct output looks like for a representative input:
const systemPrompt = `You extract invoice data from text.
Example input: "Invoice #1234 from Acme Corp, dated Jan 15 2023. Items: Widget A x5 at $10.00 each, Widget B x2 at $25.00 each. Subtotal: $100.00. Tax (8%): $8.00. Total: $108.00. Payment due: Feb 15 2023."
Example output:
{
"invoice_number": "1234",
"vendor": "Acme Corp",
"date": "2023-01-15",
"due_date": "2023-02-15",
"line_items": [
{ "description": "Widget A", "quantity": 5, "unit_price": 10.00, "total": 50.00 },
{ "description": "Widget B", "quantity": 2, "unit_price": 25.00, "total": 50.00 }
],
"subtotal": 100.00,
"tax_rate": 0.08,
"tax_amount": 8.00,
"total": 108.00
}
Extract invoice data from the following text. Match the exact schema shown above. Use null for any field that cannot be determined from the text.`;
The example serves multiple purposes: it shows the exact field names, the exact data types, the date format, how to handle array items, and what a complete output looks like. Models follow examples more reliably than they follow descriptions.
Provider-Specific Strategies
Different providers handle structured output differently, and your strategy should adapt to each:
- OpenAI: Use function calling for best results. JSON mode as fallback. Both work well and are the most reliable options available.
- Anthropic Claude: No native JSON mode or function calling (as of this writing). Use XML tags in your prompt to delineate the expected output area:
<json_output>...</json_output>. Claude respects these boundaries consistently. Parse the content between the tags. Include the output anchoring technique (end your prompt with<json_output>so the model continues inside the tags). - Open-source models: Use constrained decoding libraries like
outlinesorjsonformerthat modify the token sampling to only allow valid JSON tokens. This guarantees syntactically valid JSON at the inference level, which is the only truly reliable approach for smaller models that struggle with JSON consistency.
# Python: Using outlines for guaranteed valid JSON from open-source models
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.1")
schema = '''{
"type": "object",
"properties": {
"sentiment": { "type": "string", "enum": ["positive", "negative", "neutral"] },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"topics": { "type": "array", "items": { "type": "string" } }
},
"required": ["sentiment", "confidence", "topics"]
}'''
generator = outlines.generate.json(model, schema)
result = generator("Analyze sentiment: 'This product is amazing and fast!'")
# result is guaranteed to be valid JSON matching the schema
# No parsing errors possible - the decoding is constrained at the token level
Monitoring and Alerting
In production, you need dashboards that answer these questions in real time:
- What percentage of LLM responses are valid JSON? (Target: >99% with JSON mode)
- What percentage pass schema validation? (Target: >98%)
- Which specific fields fail validation most often, and what are the actual vs expected values?
- What is the retry rate? (Target: <5%)
- What is the total failure rate after all retries? (Target: <0.5%)
- Is the failure rate increasing over time? (Could indicate a model update or drift)
Set alerts on these metrics. A sudden spike in JSON parse failures usually indicates a model update on the provider side (they update model weights without notice), a prompt regression from a recent code change, or a new category of user input that your prompt does not handle. Catching these within minutes rather than hours makes the difference between a minor incident and an extended outage that affects all users.
Conclusion
Reliable structured output from LLMs is a solved problem if you layer the right techniques. Use API-level JSON mode and function calling as your first line of defense. Add Zod validation for runtime type safety and semantic constraints. Implement retry with error feedback for self-healing. Monitor everything continuously. Split complex schemas into focused extractions for higher per-call reliability.
The combination of function calling + Zod validation + one retry gives us 99.8% success rates on structured extraction tasks in production. That remaining 0.2% falls through to a human review queue or a fallback default value, depending on the use case.
One additional technique worth mentioning: for applications where even 0.2% failure is too high, implement a dual-extraction pattern. Run the same extraction through two different models (GPT-4 and Claude, or GPT-4 and GPT-3.5-turbo) and compare the results. If they agree, you have high confidence in the output. If they disagree, flag the extraction for human review. This consensus approach pushes reliability above 99.95% at the cost of 2x the API calls, which is worth it for high-value extractions like financial document processing or medical record analysis.
The key insight across all these techniques is that reliability comes from layers, not from any single approach. API-level constraints catch syntax errors. Schema validation catches type errors. Retry catches transient failures. Monitoring catches systematic degradation. Each layer handles a different failure mode, and together they produce a system that is reliable enough for production use. Perfect is not achievable with probabilistic models, but 99.8% (or 99.95% with dual extraction) is close enough that the remaining failures are statistical noise rather than a systematic problem that users will notice.