GitHub Copilot Cuts Token Costs with Advanced Caching and Deferred Tool Loading
Key Takeaways
- ▸Prompt caching and tool search reduce token overhead by up to 10× for cached inputs and defer expensive schema definitions until explicitly needed
- ▸Usage-based billing makes token efficiency critical—each token saved directly reduces customer costs and extends available context for longer agentic sessions
- ▸Improvements validated through production A/B testing, maintaining or improving task success rates while reducing token consumption
Summary
GitHub has announced significant improvements to token efficiency in GitHub Copilot's agentic harness, driven by the platform's shift to usage-based billing where every token directly impacts customer costs and agent capability. The improvements center on two key challenges: prompt caching—which reuses expensive model state computations across turns—and tool-definition overhead, where agents must maintain definitions for potentially hundreds of available tools. GitHub introduced a tool search mechanism that defers loading parameter schemas until needed, keeping the reusable prompt prefix leaner and cache-friendly, while extending OpenAI's prompt caching window to retain cached model state longer across sessions.
The optimizations apply across both OpenAI and Anthropic models powering Copilot, validated through production A/B testing and offline task suites that confirm token usage drops while task success rates hold or improve. Rather than pursuing single breakthrough wins, GitHub's approach reflects continuous harness-level tuning—a necessary counter to the trend that newer model generations tend to consume more tokens per task. With agents taking on increasingly longer and more autonomous coding tasks, these efficiency gains directly translate to reduced latency and preserved context window availability for complex work.
- Optimizations span both OpenAI and Anthropic models, including extended prompt caching windows and persistent WebSocket connections to eliminate repeated HTTP overhead


