BotBeat
...
← Back

> ▌

UnaUna
PRODUCT LAUNCHUna2026-03-10

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Key Takeaways

  • ▸MetalRT achieves 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple MLX by eliminating framework overhead and using custom Metal GPU shaders with ahead-of-time compilation
  • ▸RCLI open-source voice pipeline delivers sub-200ms end-to-end latency for complete STT+LLM+TTS voice AI applications, enabling responsive on-device voice interfaces without cloud APIs
  • ▸The proprietary inference engine solves the latency compounding problem in multimodal pipelines by optimizing all three modalities natively on a single GPU, directly addressing infrastructure gaps for shipping on-device AI products
Source:
Hacker Newshttps://github.com/RunanywhereAI/rcli↗

Summary

RunAnywhere, a YC W26 startup founded by Sanchit Monga and Shubham, has unveiled MetalRT, a proprietary GPU inference engine optimized for Apple Silicon that significantly outperforms existing solutions across multiple AI modalities. The engine delivers 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple's MLX framework, with benchmarks showing 658 tokens/second for Qwen3-0.6B models and sub-200ms end-to-end voice latency. RunAnywhere has also open-sourced RCLI, an MIT-licensed voice AI pipeline that brings complete speech-to-text, LLM, and text-to-speech capabilities to macOS with no cloud dependencies.

MetalRT's performance gains stem from its hardware-native approach: the engine skips abstraction layers present in other frameworks and uses custom Metal compute shaders compiled ahead-of-time, with all memory pre-allocated during initialization to eliminate runtime allocations. The technology excels particularly in voice workloads, delivering 4.6x faster speech-to-text (101ms for 70 seconds of audio) and 2.8x faster text-to-speech synthesis compared to comparable alternatives. RCLI provides end-users with a fully-featured macOS assistant including 43 voice-controlled actions, local RAG over documents with ~4ms query latency, and support for 20+ swappable models, all running locally on M1-M4 Apple Silicon chips.

  • RunAnywhere open-sourced RCLI under MIT license while keeping MetalRT proprietary, creating an accessible platform for developers while maintaining commercial differentiation

Editorial Opinion

MetalRT represents an important step forward in making on-device AI genuinely practical for consumer applications. By tackling the unglamorous but critical infrastructure problem of reducing latency in multimodal pipelines, RunAnywhere addresses a real bottleneck that has pushed many projects back to cloud APIs. The open-source release of RCLI with full voice capabilities removes barriers to experimentation and deployment, though the proprietary nature of MetalRT itself creates questions about long-term ecosystem openness and developer lock-in.

Generative AISpeech & AudioAI HardwareOpen Source

More from Una

UnaUna
RESEARCH

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

2026-03-22
UnaUna
PRODUCT LAUNCH

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

2026-03-18
UnaUna
PRODUCT LAUNCH

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

2026-03-07

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us