RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp
Key Takeaways
- ▸MetalRT achieves 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple MLX by eliminating framework overhead and using custom Metal GPU shaders with ahead-of-time compilation
- ▸RCLI open-source voice pipeline delivers sub-200ms end-to-end latency for complete STT+LLM+TTS voice AI applications, enabling responsive on-device voice interfaces without cloud APIs
- ▸The proprietary inference engine solves the latency compounding problem in multimodal pipelines by optimizing all three modalities natively on a single GPU, directly addressing infrastructure gaps for shipping on-device AI products
Summary
RunAnywhere, a YC W26 startup founded by Sanchit Monga and Shubham, has unveiled MetalRT, a proprietary GPU inference engine optimized for Apple Silicon that significantly outperforms existing solutions across multiple AI modalities. The engine delivers 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple's MLX framework, with benchmarks showing 658 tokens/second for Qwen3-0.6B models and sub-200ms end-to-end voice latency. RunAnywhere has also open-sourced RCLI, an MIT-licensed voice AI pipeline that brings complete speech-to-text, LLM, and text-to-speech capabilities to macOS with no cloud dependencies.
MetalRT's performance gains stem from its hardware-native approach: the engine skips abstraction layers present in other frameworks and uses custom Metal compute shaders compiled ahead-of-time, with all memory pre-allocated during initialization to eliminate runtime allocations. The technology excels particularly in voice workloads, delivering 4.6x faster speech-to-text (101ms for 70 seconds of audio) and 2.8x faster text-to-speech synthesis compared to comparable alternatives. RCLI provides end-users with a fully-featured macOS assistant including 43 voice-controlled actions, local RAG over documents with ~4ms query latency, and support for 20+ swappable models, all running locally on M1-M4 Apple Silicon chips.
- RunAnywhere open-sourced RCLI under MIT license while keeping MetalRT proprietary, creating an accessible platform for developers while maintaining commercial differentiation
Editorial Opinion
MetalRT represents an important step forward in making on-device AI genuinely practical for consumer applications. By tackling the unglamorous but critical infrastructure problem of reducing latency in multimodal pipelines, RunAnywhere addresses a real bottleneck that has pushed many projects back to cloud APIs. The open-source release of RCLI with full voice capabilities removes barriers to experimentation and deployment, though the proprietary nature of MetalRT itself creates questions about long-term ecosystem openness and developer lock-in.



