BotBeat
...
← Back

> ▌

UnaUna
PRODUCT LAUNCHUna2026-03-07

RunAnywhere's MetalRT Achieves 658 Tokens/Second on Apple Silicon, Outperforming MLX by 19%

Key Takeaways

  • ▸MetalRT achieved 658 tokens/second decode speed on Apple M4 Max, outperforming Apple's MLX by 19% and llama.cpp by an average of 1.67x
  • ▸The engine won decode speed benchmarks on 3 of 4 tested models, with time-to-first-token as low as 6.6ms on smaller models
  • ▸MetalRT is optimized for on-device, privacy-first AI applications including chat, coding assistants, agent workflows, and voice pipelines
Source:
Hacker Newshttps://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-engine-apple-silicon↗

Summary

RunAnywhere has released MetalRT, a new LLM inference engine optimized for Apple Silicon that claims to be the fastest decode engine available for the platform. In comprehensive benchmarks conducted on an M4 Max chip with 64GB of unified memory, MetalRT achieved a peak decode speed of 658 tokens per second on the Qwen3-0.6B model, delivering a 19% performance advantage over Apple's own MLX framework and averaging 1.67x faster performance than llama.cpp across multiple models.

The company tested MetalRT against five competing engines—uzu, mlx-lm, llama.cpp, and Ollama—across four different language models (Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, and LFM2.5-1.2B), all using 4-bit quantization. MetalRT won decode speed competitions on three of the four models tested, with particularly impressive results showing 1.35-2.14x speedups versus llama.cpp and 1.41-2.40x versus Ollama. The engine also achieved a remarkable 6.6ms time-to-first-token on the Qwen3-0.6B model, making it particularly suitable for interactive chat applications and real-time use cases.

RunAnywhere positions MetalRT as purpose-built for privacy-first, on-device AI applications including chat apps, coding assistants, agent workflows, and voice pipelines. The company emphasizes that the performance gains come from low-level Metal API optimization while maintaining identical output quality to other engines, as the underlying models remain unchanged. By enabling cloud-competitive speeds entirely on-device, MetalRT addresses a growing demand for local AI inference that doesn't compromise on performance.

  • Benchmarks used identical model files where possible, ensuring fair comparisons across engines while maintaining identical output quality

Editorial Opinion

MetalRT's performance claims are impressive, particularly the near-parity with Apple's own optimized MLX framework despite being a third-party solution. The 658 tok/s figure, while eye-catching, applies only to the smallest 0.6B parameter model—the more relevant 4B model benchmark at 186 tok/s is still strong but less sensational. What's most significant here is the growing ecosystem of highly optimized local inference engines, which collectively push the boundaries of on-device AI and make privacy-preserving applications increasingly viable. However, as with all performance benchmarks, real-world results will vary based on specific use cases, thermal constraints, and sustained workloads beyond these burst tests.

Large Language Models (LLMs)MLOps & InfrastructureAI HardwareProduct LaunchOpen Source

More from Una

UnaUna
RESEARCH

AI Blurs Line Between Tool and Collaborator, Expanding Frontier of Theoretical Physics

2026-03-22
UnaUna
PRODUCT LAUNCH

LunarGate Launches Self-Hosted AI Gateway with EU Privacy Compliance and Zero Data Leakage

2026-03-18
UnaUna
PRODUCT LAUNCH

RunAnywhere Launches MetalRT, Achieving 1.67x Faster LLM Inference on Apple Silicon Than llama.cpp

2026-03-10

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us