RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Key Takeaways

▸MetalRT achieves 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple MLX by eliminating framework overhead and using custom Metal GPU shaders with ahead-of-time compilation
▸RCLI open-source voice pipeline delivers sub-200ms end-to-end latency for complete STT+LLM+TTS voice AI applications, enabling responsive on-device voice interfaces without cloud APIs
▸The proprietary inference engine solves the latency compounding problem in multimodal pipelines by optimizing all three modalities natively on a single GPU, directly addressing infrastructure gaps for shipping on-device AI products

Source:

Hacker Newshttps://github.com/RunanywhereAI/rcli↗

Summary

RunAnywhere, a YC W26 startup founded by Sanchit Monga and Shubham, has unveiled MetalRT, a proprietary GPU inference engine optimized for Apple Silicon that significantly outperforms existing solutions across multiple AI modalities. The engine delivers 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple's MLX framework, with benchmarks showing 658 tokens/second for Qwen3-0.6B models and sub-200ms end-to-end voice latency. RunAnywhere has also open-sourced RCLI, an MIT-licensed voice AI pipeline that brings complete speech-to-text, LLM, and text-to-speech capabilities to macOS with no cloud dependencies.

MetalRT's performance gains stem from its hardware-native approach: the engine skips abstraction layers present in other frameworks and uses custom Metal compute shaders compiled ahead-of-time, with all memory pre-allocated during initialization to eliminate runtime allocations. The technology excels particularly in voice workloads, delivering 4.6x faster speech-to-text (101ms for 70 seconds of audio) and 2.8x faster text-to-speech synthesis compared to comparable alternatives. RCLI provides end-users with a fully-featured macOS assistant including 43 voice-controlled actions, local RAG over documents with ~4ms query latency, and support for 20+ swappable models, all running locally on M1-M4 Apple Silicon chips.

RunAnywhere open-sourced RCLI under MIT license while keeping MetalRT proprietary, creating an accessible platform for developers while maintaining commercial differentiation

Editorial Opinion

MetalRT represents an important step forward in making on-device AI genuinely practical for consumer applications. By tackling the unglamorous but critical infrastructure problem of reducing latency in multimodal pipelines, RunAnywhere addresses a real bottleneck that has pushed many projects back to cloud APIs. The open-source release of RCLI with full voice capabilities removes barriers to experimentation and deployment, though the proprietary nature of MetalRT itself creates questions about long-term ecosystem openness and developer lock-in.

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Key Takeaways

▸MetalRT achieves 1.67x faster LLM decoding than llama.cpp and 1.19x faster than Apple MLX by eliminating framework overhead and using custom Metal GPU shaders with ahead-of-time compilation
▸RCLI open-source voice pipeline delivers sub-200ms end-to-end latency for complete STT+LLM+TTS voice AI applications, enabling responsive on-device voice interfaces without cloud APIs
▸The proprietary inference engine solves the latency compounding problem in multimodal pipelines by optimizing all three modalities natively on a single GPU, directly addressing infrastructure gaps for shipping on-device AI products

Summary

RunAnywhere open-sourced RCLI under MIT license while keeping MetalRT proprietary, creating an accessible platform for developers while maintaining commercial differentiation

Editorial Opinion

MetalRT represents an important step forward in making on-device AI genuinely practical for consumer applications. By tackling the unglamorous but critical infrastructure problem of reducing latency in multimodal pipelines, RunAnywhere addresses a real bottleneck that has pushed many projects back to cloud APIs. The open-source release of RCLI with full voice capabilities removes barriers to experimentation and deployment, though the proprietary nature of MetalRT itself creates questions about long-term ecosystem openness and developer lock-in.

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Key Takeaways

Summary

Editorial Opinion

More from Una

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Key Takeaways

Summary

Editorial Opinion

More from Una

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA