Empty System Prompts Outperform Detailed Instructions in Claude Benchmark Study
Key Takeaways
- ▸Empty system prompts (no CLAUDE.md) achieved the highest code quality score across 1,188 benchmark runs, outperforming both readable and compressed instruction sets
- ▸Compressed instructions hurt performance for Claude Haiku and Sonnet models, with Sonnet showing a 2.81-point decrease on large compressed profiles versus readable versions
- ▸The quality difference between best and worst profiles was only 0.6 points on a 100-point scale, suggesting instruction detail has minimal impact on Claude output quality
Summary
A comprehensive benchmark study of Claude AI models has revealed surprising results about system prompt optimization. Jonathan Chilcher, a Senior SRE at TechLoom, conducted 1,188 test runs across three Claude models (Haiku 4.5, Sonnet 4.6, and Opus 4.6) using five different CLAUDE.md profile configurations ranging from empty to extensively detailed instructions. The study tested 12 standardized coding tasks across bug fixes, code generation, refactoring, and instruction-following categories, with scoring based on test pass rates, code quality metrics, and LLM evaluation.
The results contradicted conventional wisdom about AI system prompts: the empty profile with no instructions achieved the highest overall quality score of 91.8 out of 100, while the most detailed compressed profile scored 90.6—a difference of just 0.6 points across the entire spectrum. More significantly, the study directly tested Chilcher's own previous advice about compressing CLAUDE.md files by removing markdown formatting. Compressed instructions consistently underperformed readable versions for Haiku and Sonnet models, with Sonnet showing a 2.81-point quality decrease on large compressed profiles. Only Opus showed marginal improvement from compression, less than one point.
The findings suggest that detailed system instructions may provide minimal quality benefits while consuming valuable context window space. The extremely narrow quality spread of 0.6 points across all configurations indicates that Claude models perform remarkably consistently regardless of instruction detail level. Chilcher has open-sourced the benchmarking tool, claude-benchmark, enabling other developers to test their own prompt configurations with standardized, reproducible methodology combining automated testing, linting, complexity analysis, and LLM-based code evaluation.
- The open-source claude-benchmark tool provides standardized A/B testing for system prompts using automated metrics and LLM evaluation
Editorial Opinion
This research represents a valuable contribution to the growing body of empirical prompt engineering science, moving beyond anecdotal advice to data-driven conclusions. The finding that empty prompts match or exceed detailed instructions challenges assumptions about AI system design and suggests that frontier models like Claude may have sufficiently strong base capabilities that extensive prompting becomes counterproductive. The relatively small quality variance across all configurations—just 0.6 points—is perhaps the study's most important insight, indicating that developers might be over-optimizing system prompts when their effort would be better spent elsewhere. The open-sourcing of the benchmark tool is equally important, providing the community with reproducible methodology to test prompt engineering claims.



