Silent-Bench Exposes Critical Silent Failures in LLM API Gateways—47.96% Error Rates vs. 1.89% on Direct APIs
Key Takeaways
- ▸Commercial LLM API gateways show error rates up to 46 percentage points higher than direct API calls, indicating silent failures in the routing layer rather than upstream models
- ▸Silent-Bench provides cryptographically-attested forensic auditing using Merkle trees and Ed25519 signatures, allowing independent verification of API behavior without trusting the auditor
- ▸Detected failures include response format errors (47.96% vs. 1.89%), token billing inflation (~55%), and silent behavior changes across model deployments
Summary
A new cryptographically-audited research framework called Silent-Bench has revealed that commercial LLM API gateways are producing silent failures—requests that appear successful at the HTTP layer but return semantically broken content—at rates dramatically higher than direct API calls to upstream models. In a case study of one unnamed gateway (Proxy-A), the error rate reached 47.96% for certain parameter configurations on one model, compared to just 1.89% when the identical request was sent directly to the upstream provider's API, a gap of approximately 46 percentage points.
The research, conducted by independent researcher Wesam H. Al-Sabban and published with cryptographic attestation, introduces a methodology for detecting and verifying such failures beyond vendor pushback. The framework combines parameter-space sweeps, invisibility scans for hidden behaviors like token-billing inflation, and Merkle-tree hashing with Ed25519 signatures so that any third party can verify the findings independently. Case studies document failures in gateway routing layers, token-billing inflation of ~55% in one deployment, and cross-model effect isolation techniques.
The author has committed to a 90-day coordinated disclosure window, with vendor names anonymized until August 10, 2026. The framework and source code will be released publicly on GitHub by May 26, 2026, under Apache-2.0 license. The research also documents methodological learnings, including the 'small-sample artifact pattern,' where effect sizes estimated on fewer than 10 samples per condition are systematically inflated.
- Framework will be open-sourced under Apache-2.0 with full reproduction commands and verification protocols; vendor identities to be disclosed August 10, 2026 under standard coordinated disclosure
Editorial Opinion
Silent-Bench addresses a critical blind spot in LLM deployment infrastructure: the assumption that HTTP-level success guarantees semantic correctness. By combining causal ablation, cryptographic proof, and methodological rigor (including documented retractions), Al-Sabban sets a gold standard for infrastructure auditing in the era of API-dependent AI systems. This research is essential reading for any organization routing production traffic through third-party LLM gateways, and the public release of the framework should become a baseline expectation for gateway providers to invite independent audit.


