Silent-Bench Exposes Critical Silent Failures in LLM API Gateways—47.96% Error Rates vs. 1.89% on Direct APIs

Key Takeaways

▸Commercial LLM API gateways show error rates up to 46 percentage points higher than direct API calls, indicating silent failures in the routing layer rather than upstream models
▸Silent-Bench provides cryptographically-attested forensic auditing using Merkle trees and Ed25519 signatures, allowing independent verification of API behavior without trusting the auditor
▸Detected failures include response format errors (47.96% vs. 1.89%), token billing inflation (~55%), and silent behavior changes across model deployments

Source:

Hacker Newshttps://doi.org/10.5281/zenodo.20128451↗

Summary

A new cryptographically-audited research framework called Silent-Bench has revealed that commercial LLM API gateways are producing silent failures—requests that appear successful at the HTTP layer but return semantically broken content—at rates dramatically higher than direct API calls to upstream models. In a case study of one unnamed gateway (Proxy-A), the error rate reached 47.96% for certain parameter configurations on one model, compared to just 1.89% when the identical request was sent directly to the upstream provider's API, a gap of approximately 46 percentage points.

The research, conducted by independent researcher Wesam H. Al-Sabban and published with cryptographic attestation, introduces a methodology for detecting and verifying such failures beyond vendor pushback. The framework combines parameter-space sweeps, invisibility scans for hidden behaviors like token-billing inflation, and Merkle-tree hashing with Ed25519 signatures so that any third party can verify the findings independently. Case studies document failures in gateway routing layers, token-billing inflation of ~55% in one deployment, and cross-model effect isolation techniques.

The author has committed to a 90-day coordinated disclosure window, with vendor names anonymized until August 10, 2026. The framework and source code will be released publicly on GitHub by May 26, 2026, under Apache-2.0 license. The research also documents methodological learnings, including the 'small-sample artifact pattern,' where effect sizes estimated on fewer than 10 samples per condition are systematically inflated.

Framework will be open-sourced under Apache-2.0 with full reproduction commands and verification protocols; vendor identities to be disclosed August 10, 2026 under standard coordinated disclosure

Editorial Opinion

Silent-Bench addresses a critical blind spot in LLM deployment infrastructure: the assumption that HTTP-level success guarantees semantic correctness. By combining causal ablation, cryptographic proof, and methodological rigor (including documented retractions), Al-Sabban sets a gold standard for infrastructure auditing in the era of API-dependent AI systems. This research is essential reading for any organization routing production traffic through third-party LLM gateways, and the public release of the framework should become a baseline expectation for gateway providers to invite independent audit.

Silent-Bench Exposes Critical Silent Failures in LLM API Gateways—47.96% Error Rates vs. 1.89% on Direct APIs

Key Takeaways

▸Commercial LLM API gateways show error rates up to 46 percentage points higher than direct API calls, indicating silent failures in the routing layer rather than upstream models
▸Silent-Bench provides cryptographically-attested forensic auditing using Merkle trees and Ed25519 signatures, allowing independent verification of API behavior without trusting the auditor
▸Detected failures include response format errors (47.96% vs. 1.89%), token billing inflation (~55%), and silent behavior changes across model deployments

Summary

Framework will be open-sourced under Apache-2.0 with full reproduction commands and verification protocols; vendor identities to be disclosed August 10, 2026 under standard coordinated disclosure

Editorial Opinion

Silent-Bench addresses a critical blind spot in LLM deployment infrastructure: the assumption that HTTP-level success guarantees semantic correctness. By combining causal ablation, cryptographic proof, and methodological rigor (including documented retractions), Al-Sabban sets a gold standard for infrastructure auditing in the era of API-dependent AI systems. This research is essential reading for any organization routing production traffic through third-party LLM gateways, and the public release of the framework should become a baseline expectation for gateway providers to invite independent audit.

Silent-Bench Exposes Critical Silent Failures in LLM API Gateways—47.96% Error Rates vs. 1.89% on Direct APIs

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

Study Reveals 10 Minutes of AI Assistance Can Impair Problem-Solving Skills

LOREIN: Independent Researcher Unveils Persistent, Sovereign AI Architecture After 4-Year Development

From Agents to Institutions: Field Study Shows Organizational Controls Are Essential for Reliable AI Labor

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

Quixotic AI Launches Open-Source JVM-Native AI Stack for Enterprise Infrastructure

Silent-Bench Exposes Critical Silent Failures in LLM API Gateways—47.96% Error Rates vs. 1.89% on Direct APIs

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

Study Reveals 10 Minutes of AI Assistance Can Impair Problem-Solving Skills

LOREIN: Independent Researcher Unveils Persistent, Sovereign AI Architecture After 4-Year Development

From Agents to Institutions: Field Study Shows Organizational Controls Are Essential for Reliable AI Labor

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

Quixotic AI Launches Open-Source JVM-Native AI Stack for Enterprise Infrastructure