Testing 288 LLM Outputs Reveals Consistent JSON Parsing Failures Across All Providers
Key Takeaways
- ▸Markdown fences wrapping JSON are the single most common failure mode across all LLM providers, occurring consistently even with explicit instructions to avoid them
- ▸Multiple failure modes often compound in the same response, breaking isolated fixes and requiring comprehensive, tested error recovery strategies
- ▸Even sophisticated models exhibit language-specific output issues, including Python syntax (True/False/None) mixed into JSON, trailing commas, comments, and unescaped characters
Summary
An independent researcher tested structured output from 288 model calls across every major LLM provider to identify failure modes in JSON generation. Using OpenRouter to evaluate models including GPT-4o, Claude, Gemini, Llama 3, Mistral, Command R, DeepSeek, and Qwen, the researcher found remarkably consistent patterns of broken output across all providers—patterns that remain largely unaddressed despite JSON mode support in some models.
The most common failure modes identified include markdown fences wrapping JSON output (the single most frequent issue across 288 calls), trailing commas borrowed from JavaScript syntax, language-specific boolean/null representations (Python's True/False/None instead of JSON's true/false/null), JSON-incompatible comments, and unescaped quotes within string values. The research reveals a critical insight: while individual failures are manageable, the real production problem emerges when multiple failures compound in the same response, breaking naive regex-based fix-all patterns and cascading into downstream parsing failures.
These findings highlight a fundamental gap in production LLM integrations: model outputs often 'almost' conform to specifications, but that "almost" creates cascading failures. JSON mode helps where available but isn't universal across models or exposed by all providers. The research suggests that robust production systems require defensive validation and error recovery strategies rather than expecting perfect structured output from models.
- JSON mode doesn't universally solve the problem—it's unavailable on all models and not exposed by all providers, forcing production systems to implement defensive parsing
Editorial Opinion
This research addresses one of the least glamorous but most critical challenges in production LLM systems. The consistent failure patterns across all providers suggest this isn't a problem with any single model, but a fundamental mismatch between how LLMs generate text and how structured data should work. The findings make a compelling case that teams need to move beyond hoping for perfect output and instead invest in defensive validation and recovery strategies—treating structured output failures as inevitable rather than exceptional.


