Inside Claude Code's Prompt Caching: How Anthropic Cut Costs 80% Through Architectural Constraints
Key Takeaways
- ▸Prompt caching reduces Claude Code session costs by 80-90% (from $50-100 to $10-19) by storing KV cache computations and charging $0.50 vs $5 per million tokens
- ▸The system is prefix-based and fragile: any change to earlier parts of the prompt invalidates all subsequent cached computations
- ▸Anthropic's Claude Code team treats cache hit rates as critical infrastructure, declaring severity events when they drop
Summary
Engineer Abhishek Ray has published detailed experiments revealing how prompt caching works in Claude Code, the architectural foundation that makes Anthropic's coding assistant economically viable. Through four practical experiments with the Anthropic API, Ray demonstrates how prompt caching reduces costs from $50-100 per extended coding session to just $10-19 by storing and reusing computed intermediate states. The technique, called prefix caching, stores the Key-Value cache from transformer attention mechanisms, allowing the model to skip reprocessing unchanged portions of conversation history. However, the system is fragile: any change to the prompt prefix—adding an MCP tool, inserting a timestamp, or switching models—can invalidate the entire cache and quintuple costs for that interaction.
The research reveals that Claude Code's engineering team treats prompt caching as a critical architectural constraint, even declaring SEVs (severity events) when cache hit rates drop. With a 90% cache hit rate, cached token reads cost just $0.50 per million tokens compared to $5 for uncached processing on Opus. This cost structure is what enables Claude Code Pro's $20/month subscription model to remain profitable. Ray explains the technical mechanism: during transformer attention, each token produces Query, Key, and Value vectors, and the KV cache stores these intermediate computations for already-processed tokens, eliminating redundant computation as conversation history grows.
The experiments highlight a fundamental tradeoff in LLM API design: while prefix caching enables dramatic cost savings, it creates brittleness where seemingly minor changes—like reordering tools or adding timestamps—can trigger expensive cache invalidations. This reveals how modern AI products are increasingly built around cost optimization constraints rather than purely on capability improvements, with architectural decisions shaped by the economics of token processing at scale.
- The technique works because transformer attention is autoregressive—each token's computation depends only on previous tokens, making prefix caching mathematically valid
- Minor implementation choices like timestamp placement or tool ordering can 5x costs by breaking cache coherence
Editorial Opinion
This research exposes a fascinating tension in production AI systems: the gap between theoretical capability and economic viability. Anthropic has essentially built Claude Code around a caching hack that requires extraordinary engineering discipline to maintain—one misplaced timestamp can blow up costs 5x. It's a reminder that the current generation of AI products isn't constrained by what models can do, but by what providers can afford to run at scale. The fact that cache hit rates warrant SEV declarations tells you everything about where the real engineering challenges lie in 2026.



