ContextCache Open-Source Tool Achieves 29x Speedup in LLM Tool Calling by Caching KV States

Key Takeaways

▸ContextCache achieves up to 29.2x speedup in LLM tool-calling by caching tool schema KV states, reducing TTFT from 5.6 seconds to 193ms for 50 tools
▸The system skips 99% of prefill tokens while maintaining identical quality (TSA 0.850), potentially saving 62 million tokens daily at moderate scale
▸Includes CPU-only orchestrator option using llama.cpp that routes queries in ~550ms without requiring GPU resources

Source:

Hacker Newshttps://github.com/spranab/contextcache↗

Summary

Developer spranab has released ContextCache, an open-source middleware solution that dramatically accelerates tool-calling performance in Large Language Models by caching key-value (KV) states of tool schemas. The system addresses a fundamental inefficiency in current LLM architectures: every tool-calling request traditionally resends complete tool schemas through the prefill stage, reprocessing thousands of tokens even when tools remain unchanged. With 50 tools, this amounts to approximately 6,000 tokens being reprocessed on every single request for every user.

ContextCache compiles tool schemas into a KV cache once and reuses it across all requests, allowing only user queries to pass through prefill. Testing on Qwen3-8B running on an RTX 3090 Ti demonstrated impressive results: with 50 tools, time-to-first-token (TTFT) dropped from 5,625ms to 193ms—a 29.2x speedup—while skipping 99% of prefill tokens with zero quality degradation. The system maintains a Tool Schema Accuracy (TSA) of 0.850, matching full prefill performance exactly. At scale, this translates to saving 62.1 million tokens per day at 10,000 requests.

The project includes two deployment options: a route-only mode (~500ms) for tool detection without GPU requirements, and a full pipeline (~3 seconds) that handles routing, parameter extraction, execution, and response synthesis. The route-only orchestrator uses llama.cpp with Qwen3.5-2B on CPU, making it accessible for resource-constrained environments. ContextCache is compatible with any LLM backend including Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, and self-hosted solutions. The project is released under CC BY 4.0 license with accompanying research paper and is available on GitHub.

Open-sourced under CC BY 4.0 license with compatibility across major LLM providers including OpenAI, Anthropic, DeepSeek, and self-hosted solutions

Editorial Opinion

ContextCache tackles a genuinely wasteful aspect of current LLM architectures that most developers have simply accepted as inevitable overhead. The 29x speedup isn't from algorithmic cleverness or model compression—it's from recognizing that reprocessing static tool definitions thousands of times per day is fundamentally unnecessary. What makes this particularly valuable is the dual-mode approach: organizations can start with the CPU-only router for immediate benefits, then graduate to the full GPU-accelerated pipeline as needs grow, making advanced LLM optimization accessible beyond well-resourced ML teams.

ContextCache Open-Source Tool Achieves 29x Speedup in LLM Tool Calling by Caching KV States

Key Takeaways

▸ContextCache achieves up to 29.2x speedup in LLM tool-calling by caching tool schema KV states, reducing TTFT from 5.6 seconds to 193ms for 50 tools
▸The system skips 99% of prefill tokens while maintaining identical quality (TSA 0.850), potentially saving 62 million tokens daily at moderate scale
▸Includes CPU-only orchestrator option using llama.cpp that routes queries in ~550ms without requiring GPU resources

Summary

Open-sourced under CC BY 4.0 license with compatibility across major LLM providers including OpenAI, Anthropic, DeepSeek, and self-hosted solutions

Editorial Opinion

ContextCache tackles a genuinely wasteful aspect of current LLM architectures that most developers have simply accepted as inevitable overhead. The 29x speedup isn't from algorithmic cleverness or model compression—it's from recognizing that reprocessing static tool definitions thousands of times per day is fundamentally unnecessary. What makes this particularly valuable is the dual-mode approach: organizations can start with the CPU-only router for immediate benefits, then graduate to the full GPU-accelerated pipeline as needs grow, making advanced LLM optimization accessible beyond well-resourced ML teams.

ContextCache Open-Source Tool Achieves 29x Speedup in LLM Tool Calling by Caching KV States

Key Takeaways

Summary

Editorial Opinion

More from Independent/Open Source

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

ContextCache Open-Source Tool Achieves 29x Speedup in LLM Tool Calling by Caching KV States

Key Takeaways

Summary

Editorial Opinion

More from Independent/Open Source

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA