BotBeat
...
← Back

> ▌

ON1ON1
PRODUCT LAUNCHON12026-05-29

ON1 Launches G116 V8: Revolutionary Virtual Chip ISA Achieves 38μs AI Memory Retrieval

Key Takeaways

  • ▸G116 V8 introduces latency-separated tiers (Fetch/Compute/Search) that expose previously hidden bottlenecks in AI memory retrieval, a critical gap for real-time LLM inference
  • ▸Achieves sub-microsecond latency on Fetch and Compute layers (0.1–2.0 μs), with transparent decomposition enabling precise optimization opportunities for developers
  • ▸Public test endpoint live and accessible, demonstrating ON1's confidence in verification and commitment to benchmarking transparency
Source:
Hacker Newshttps://github.com/ON1-Hao/ON1↗

Summary

ON1 has announced G116 V8, a quantum-inspired virtual chip ISA designed to transform AI memory retrieval for large language models and real-time retrieval-augmented generation (RAG) systems. Unlike conventional vector databases that provide opaque query latencies, G116 V8 decomposes vector retrieval into three observable hardware tiers—Fetch (0.1–0.5 μs), Compute (0.4–2.0 μs), and Search (3–10 ms)—enabling developers to identify and optimize bottlenecks in their AI inference pipelines with granular precision.

The system leverages mmap-based zero-copy memory mapping, NumPy/BLAS vector transformations, and brute-force ANN search, with FAISS and HNSW indexing planned for future releases. Built specifically for real-time LLM grounding with llama.cpp compatibility, G116 V8 offers latency visibility that traditional systems like FAISS, Milvus, and pgvector cannot provide. This transparent decomposition addresses a critical gap in production AI systems where memory and compute bottlenecks are typically hidden within opaque query times.

ON1 has made the technology immediately accessible via a live public test endpoint, allowing developers to verify the latency decomposition in real-world scenarios. The roadmap includes GPU acceleration and advanced indexing to further optimize the Search tier, positioning G116 V8 as infrastructure for the next generation of latency-critical AI applications.

  • GPU acceleration and FAISS/HNSW indexing on the roadmap to address the Search-layer bottleneck (currently 3–10 ms on CPU)

Editorial Opinion

G116 V8 tackles a real problem in production AI systems—the black-box latency of vector retrieval. While the 'quantum-inspired' framing is largely marketing, the core innovation of transparent latency decomposition is genuinely valuable for engineers optimizing LLM pipelines. The challenge: achieving 38μs on Fetch/Compute is impressive, but the 3–10 ms Search layer will quickly become the bottleneck. If ON1 delivers on GPU acceleration and indexing promises, this could become essential infrastructure for real-time AI systems. Worth monitoring.

Machine LearningMLOps & InfrastructureAI Hardware

More from ON1

ON1ON1
PRODUCT LAUNCH

ON1's Restore AI Photo Restoration Tool Produces AI Hallucinations Instead of Authentic Restorations

2026-04-02

Comments

Suggested

[Please specify][Please specify]
RESEARCH

Researchers Propose LLM-Based Approach to Evaluate Retrieval Systems Without Ground-Truth Labels

2026-05-29
Independent ResearchIndependent Research
RESEARCH

Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding

2026-05-29
AnthropicAnthropic
UPDATE

Claude Code Performance Degraded Before Opus 4.8 Release; Root Cause Traced to CLI Harness

2026-05-29
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us