Nyquest v3.1.1: Open-Source Token Compression Proxy Achieves 15–75% LLM Cost Savings
Key Takeaways
- ▸Nyquest v3.1.1 achieves 15–75% token reduction through 350+ rules plus semantic LLM compression, enabling significant cost savings across major LLM providers
- ▸The semantic stage compresses system prompts by 56% and conversation history by 75%, with total latency of 200–350ms on GPU hardware
- ▸Production benchmarks show 1,408 req/s throughput, minimal memory footprint (71.4 MB), and compatibility with OpenAI-compatible endpoints via SSE streaming
Summary
Nyquest, an open-source token compression proxy written in Rust, has released v3.1.1 with significant improvements to its semantic compression capabilities. The tool reduces LLM token usage by 15–75% through a combination of 350+ compiled regex rules and local LLM semantic condensation using Qwen 2.5 1.5B, without sacrificing semantic meaning. The proxy operates as a drop-in middleware compatible with major LLM providers including Anthropic, OpenAI, Gemini, xAI, and OpenRouter.
The latest version introduces a local semantic LLM stage that compresses system prompts by 56% and conversation history by 75% on top of the existing rule-based engine. Production benchmarks demonstrate 1,408 req/s concurrent throughput with minimal resource overhead (71.4 MB RSS), and a one-shot installer enables deployment across three hardware tiers. Real-world testing shows cost savings ranging from $4.60 to $276 monthly for common LLM models at scale (100M tokens/month), with aggregate compression of 26.9% on natural prompts and up to 76.1% on individual requests.
- Three-tier hardware support ranges from rule-only compression (Tier 1) to GPU-accelerated semantic compression (Tier 2), making the tool accessible across different infrastructure profiles
- At scale (100M tokens/month), monthly savings reach $276 for Claude Opus and $80.70 for Grok 3 at compression level 1.0
Editorial Opinion
Nyquest addresses a critical pain point in LLM cost management—the tension between token efficiency and semantic fidelity. By combining lightweight rule-based compression with local semantic refinement via a small language model, the project demonstrates a pragmatic approach to reducing API costs without relying on external dependencies or sacrificing quality. The impressive production metrics (1.4k req/s, minimal memory overhead) and broad provider compatibility make this a potentially valuable tool for organizations with significant LLM token consumption.



