BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-03-05

Inside Claude Code's Prompt Caching: How Anthropic Cut Costs 80% Through Architectural Constraints

Key Takeaways

  • ▸Prompt caching reduces Claude Code session costs by 80-90% (from $50-100 to $10-19) by storing KV cache computations and charging $0.50 vs $5 per million tokens
  • ▸The system is prefix-based and fragile: any change to earlier parts of the prompt invalidates all subsequent cached computations
  • ▸Anthropic's Claude Code team treats cache hit rates as critical infrastructure, declaring severity events when they drop
Source:
Hacker Newshttps://www.claudecodecamp.com/p/how-prompt-caching-actually-works-in-claude-code↗

Summary

Engineer Abhishek Ray has published detailed experiments revealing how prompt caching works in Claude Code, the architectural foundation that makes Anthropic's coding assistant economically viable. Through four practical experiments with the Anthropic API, Ray demonstrates how prompt caching reduces costs from $50-100 per extended coding session to just $10-19 by storing and reusing computed intermediate states. The technique, called prefix caching, stores the Key-Value cache from transformer attention mechanisms, allowing the model to skip reprocessing unchanged portions of conversation history. However, the system is fragile: any change to the prompt prefix—adding an MCP tool, inserting a timestamp, or switching models—can invalidate the entire cache and quintuple costs for that interaction.

The research reveals that Claude Code's engineering team treats prompt caching as a critical architectural constraint, even declaring SEVs (severity events) when cache hit rates drop. With a 90% cache hit rate, cached token reads cost just $0.50 per million tokens compared to $5 for uncached processing on Opus. This cost structure is what enables Claude Code Pro's $20/month subscription model to remain profitable. Ray explains the technical mechanism: during transformer attention, each token produces Query, Key, and Value vectors, and the KV cache stores these intermediate computations for already-processed tokens, eliminating redundant computation as conversation history grows.

The experiments highlight a fundamental tradeoff in LLM API design: while prefix caching enables dramatic cost savings, it creates brittleness where seemingly minor changes—like reordering tools or adding timestamps—can trigger expensive cache invalidations. This reveals how modern AI products are increasingly built around cost optimization constraints rather than purely on capability improvements, with architectural decisions shaped by the economics of token processing at scale.

  • The technique works because transformer attention is autoregressive—each token's computation depends only on previous tokens, making prefix caching mathematically valid
  • Minor implementation choices like timestamp placement or tool ordering can 5x costs by breaking cache coherence

Editorial Opinion

This research exposes a fascinating tension in production AI systems: the gap between theoretical capability and economic viability. Anthropic has essentially built Claude Code around a caching hack that requires extraordinary engineering discipline to maintain—one misplaced timestamp can blow up costs 5x. It's a reminder that the current generation of AI products isn't constrained by what models can do, but by what providers can afford to run at scale. The fact that cache hit rates warrant SEV declarations tells you everything about where the real engineering challenges lie in 2026.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureFinance & FintechMarket Trends

More from Anthropic

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Security Researcher Exposes Critical Infrastructure After Following Claude's Configuration Advice Without Authentication

2026-04-05

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
Sweden Polytechnic InstituteSweden Polytechnic Institute
RESEARCH

Research Reveals Brevity Constraints Can Improve LLM Accuracy by Up to 26.3%

2026-04-05
Research CommunityResearch Community
RESEARCH

TELeR: New Taxonomy Framework for Standardizing LLM Prompt Benchmarking on Complex Tasks

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us