BotBeat
...
← Back

> ▌

Independent/Open SourceIndependent/Open Source
OPEN SOURCEIndependent/Open Source2026-03-04

ContextCache Open-Source Tool Achieves 29x Speedup in LLM Tool Calling by Caching KV States

Key Takeaways

  • ▸ContextCache achieves up to 29.2x speedup in LLM tool-calling by caching tool schema KV states, reducing TTFT from 5.6 seconds to 193ms for 50 tools
  • ▸The system skips 99% of prefill tokens while maintaining identical quality (TSA 0.850), potentially saving 62 million tokens daily at moderate scale
  • ▸Includes CPU-only orchestrator option using llama.cpp that routes queries in ~550ms without requiring GPU resources
Source:
Hacker Newshttps://github.com/spranab/contextcache↗

Summary

Developer spranab has released ContextCache, an open-source middleware solution that dramatically accelerates tool-calling performance in Large Language Models by caching key-value (KV) states of tool schemas. The system addresses a fundamental inefficiency in current LLM architectures: every tool-calling request traditionally resends complete tool schemas through the prefill stage, reprocessing thousands of tokens even when tools remain unchanged. With 50 tools, this amounts to approximately 6,000 tokens being reprocessed on every single request for every user.

ContextCache compiles tool schemas into a KV cache once and reuses it across all requests, allowing only user queries to pass through prefill. Testing on Qwen3-8B running on an RTX 3090 Ti demonstrated impressive results: with 50 tools, time-to-first-token (TTFT) dropped from 5,625ms to 193ms—a 29.2x speedup—while skipping 99% of prefill tokens with zero quality degradation. The system maintains a Tool Schema Accuracy (TSA) of 0.850, matching full prefill performance exactly. At scale, this translates to saving 62.1 million tokens per day at 10,000 requests.

The project includes two deployment options: a route-only mode (~500ms) for tool detection without GPU requirements, and a full pipeline (~3 seconds) that handles routing, parameter extraction, execution, and response synthesis. The route-only orchestrator uses llama.cpp with Qwen3.5-2B on CPU, making it accessible for resource-constrained environments. ContextCache is compatible with any LLM backend including Ollama, Claude, OpenAI, xAI, DeepSeek, Groq, and self-hosted solutions. The project is released under CC BY 4.0 license with accompanying research paper and is available on GitHub.

  • Open-sourced under CC BY 4.0 license with compatibility across major LLM providers including OpenAI, Anthropic, DeepSeek, and self-hosted solutions

Editorial Opinion

ContextCache tackles a genuinely wasteful aspect of current LLM architectures that most developers have simply accepted as inevitable overhead. The 29x speedup isn't from algorithmic cleverness or model compression—it's from recognizing that reprocessing static tool definitions thousands of times per day is fundamentally unnecessary. What makes this particularly valuable is the dual-mode approach: organizations can start with the CPU-only router for immediate benefits, then graduate to the full GPU-accelerated pipeline as needs grow, making advanced LLM optimization accessible beyond well-resourced ML teams.

Large Language Models (LLMs)AI AgentsMachine LearningMLOps & InfrastructureOpen Source

More from Independent/Open Source

Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

2026-03-24
Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

2026-03-07
Independent/Open SourceIndependent/Open Source
OPEN SOURCE

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

2026-03-06

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us