ON1 Launches G116 V8: Revolutionary Virtual Chip ISA Achieves 38μs AI Memory Retrieval
Key Takeaways
- ▸G116 V8 introduces latency-separated tiers (Fetch/Compute/Search) that expose previously hidden bottlenecks in AI memory retrieval, a critical gap for real-time LLM inference
- ▸Achieves sub-microsecond latency on Fetch and Compute layers (0.1–2.0 μs), with transparent decomposition enabling precise optimization opportunities for developers
- ▸Public test endpoint live and accessible, demonstrating ON1's confidence in verification and commitment to benchmarking transparency
Summary
ON1 has announced G116 V8, a quantum-inspired virtual chip ISA designed to transform AI memory retrieval for large language models and real-time retrieval-augmented generation (RAG) systems. Unlike conventional vector databases that provide opaque query latencies, G116 V8 decomposes vector retrieval into three observable hardware tiers—Fetch (0.1–0.5 μs), Compute (0.4–2.0 μs), and Search (3–10 ms)—enabling developers to identify and optimize bottlenecks in their AI inference pipelines with granular precision.
The system leverages mmap-based zero-copy memory mapping, NumPy/BLAS vector transformations, and brute-force ANN search, with FAISS and HNSW indexing planned for future releases. Built specifically for real-time LLM grounding with llama.cpp compatibility, G116 V8 offers latency visibility that traditional systems like FAISS, Milvus, and pgvector cannot provide. This transparent decomposition addresses a critical gap in production AI systems where memory and compute bottlenecks are typically hidden within opaque query times.
ON1 has made the technology immediately accessible via a live public test endpoint, allowing developers to verify the latency decomposition in real-world scenarios. The roadmap includes GPU acceleration and advanced indexing to further optimize the Search tier, positioning G116 V8 as infrastructure for the next generation of latency-critical AI applications.
- GPU acceleration and FAISS/HNSW indexing on the roadmap to address the Search-layer bottleneck (currently 3–10 ms on CPU)
Editorial Opinion
G116 V8 tackles a real problem in production AI systems—the black-box latency of vector retrieval. While the 'quantum-inspired' framing is largely marketing, the core innovation of transparent latency decomposition is genuinely valuable for engineers optimizing LLM pipelines. The challenge: achieving 38μs on Fetch/Compute is impressive, but the 3–10 ms Search layer will quickly become the bottleneck. If ON1 delivers on GPU acceleration and indexing promises, this could become essential infrastructure for real-time AI systems. Worth monitoring.

![[Please specify]](/logos/1683.png)

