RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Key Takeaways

▸MetalRT achieved 658 tokens/second decode speed on Apple M4 Max, outperforming Apple's MLX by 19% and llama.cpp by an average of 1.67x
▸The engine won decode speed benchmarks on 3 of 4 tested models, with time-to-first-token as low as 6.6ms on smaller models
▸MetalRT is optimized for on-device, privacy-first AI applications including chat, coding assistants, agent workflows, and voice pipelines

Source:

Hacker Newshttps://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-engine-apple-silicon↗

Summary

RunAnywhere has released MetalRT, a new LLM inference engine optimized for Apple Silicon that claims to be the fastest decode engine available for the platform. In comprehensive benchmarks conducted on an M4 Max chip with 64GB of unified memory, MetalRT achieved a peak decode speed of 658 tokens per second on the Qwen3-0.6B model, delivering a 19% performance advantage over Apple's own MLX framework and averaging 1.67x faster performance than llama.cpp across multiple models.

The company tested MetalRT against five competing engines—uzu, mlx-lm, llama.cpp, and Ollama—across four different language models (Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, and LFM2.5-1.2B), all using 4-bit quantization. MetalRT won decode speed competitions on three of the four models tested, with particularly impressive results showing 1.35-2.14x speedups versus llama.cpp and 1.41-2.40x versus Ollama. The engine also achieved a remarkable 6.6ms time-to-first-token on the Qwen3-0.6B model, making it particularly suitable for interactive chat applications and real-time use cases.

RunAnywhere positions MetalRT as purpose-built for privacy-first, on-device AI applications including chat apps, coding assistants, agent workflows, and voice pipelines. The company emphasizes that the performance gains come from low-level Metal API optimization while maintaining identical output quality to other engines, as the underlying models remain unchanged. By enabling cloud-competitive speeds entirely on-device, MetalRT addresses a growing demand for local AI inference that doesn't compromise on performance.

Benchmarks used identical model files where possible, ensuring fair comparisons across engines while maintaining identical output quality

Editorial Opinion

MetalRT's performance claims are impressive, particularly the near-parity with Apple's own optimized MLX framework despite being a third-party solution. The 658 tok/s figure, while eye-catching, applies only to the smallest 0.6B parameter model—the more relevant 4B model benchmark at 186 tok/s is still strong but less sensational. What's most significant here is the growing ecosystem of highly optimized local inference engines, which collectively push the boundaries of on-device AI and make privacy-preserving applications increasingly viable. However, as with all performance benchmarks, real-world results will vary based on specific use cases, thermal constraints, and sustained workloads beyond these burst tests.

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Key Takeaways

▸MetalRT achieved 658 tokens/second decode speed on Apple M4 Max, outperforming Apple's MLX by 19% and llama.cpp by an average of 1.67x
▸The engine won decode speed benchmarks on 3 of 4 tested models, with time-to-first-token as low as 6.6ms on smaller models
▸MetalRT is optimized for on-device, privacy-first AI applications including chat, coding assistants, agent workflows, and voice pipelines

Summary

Benchmarks used identical model files where possible, ensuring fair comparisons across engines while maintaining identical output quality

Editorial Opinion

MetalRT's performance claims are impressive, particularly the near-parity with Apple's own optimized MLX framework despite being a third-party solution. The 658 tok/s figure, while eye-catching, applies only to the smallest 0.6B parameter model—the more relevant 4B model benchmark at 186 tok/s is still strong but less sensational. What's most significant here is the growing ecosystem of highly optimized local inference engines, which collectively push the boundaries of on-device AI and make privacy-preserving applications increasingly viable. However, as with all performance benchmarks, real-world results will vary based on specific use cases, thermal constraints, and sustained workloads beyond these burst tests.

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Key Takeaways

Summary

Editorial Opinion

More from Una

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Key Takeaways

Summary

Editorial Opinion

More from Una

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

Comments

Suggested

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption