BotBeat
...
← Back

> ▌

Independent/Open SourceIndependent/Open Source
OPEN SOURCEIndependent/Open Source2026-02-26

vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

  • ▸vLLM-MLX enables local LLM inference on Apple Silicon at 65 tok/s on M3 Ultra, with peak speeds exceeding 400 tok/s
  • ▸The server provides OpenAI and Anthropic-compatible APIs, supporting tool calling, multimodal models, and continuous batching
  • ▸Persistent prompt caching delivers 10-15x speedups in multi-turn conversations by avoiding redundant token processing
Source:
Hacker Newshttps://github.com/raullenchai/vllm-mlx↗

Summary

A new open-source project called vLLM-MLX is enabling fast large language model inference on Apple Silicon devices, achieving speeds of up to 65 tokens per second on M3 Ultra hardware. Built on the MLX framework, the tool provides an OpenAI and Anthropic-compatible server that runs entirely on Mac computers, supporting models like Llama, Qwen-VL, and LLaVA with features including continuous batching, tool calling, and multimodal capabilities.

The project, maintained by developer raullen as a fork of waybarrios/vllm-mlx, adds 37 commits with production-grade enhancements specifically designed for coding agents. Key improvements include robust tool calling support in both streaming and non-streaming modes, reasoning separation that cleanly isolates reasoning from content output, and persistent prompt caching that delivers 10-15x speedups in multi-turn conversations by saving over 20,000 tokens of prefill on cache hits.

The implementation supports the Model Context Protocol (MCP) for tool integration and works with various AI coding assistants. With reported speeds of up to 400+ tokens per second in optimal configurations and native MLX backend support, vLLM-MLX represents a significant advancement in making powerful LLM inference accessible on consumer Apple hardware without requiring cloud services or external GPUs.

  • The project is open source and specifically optimized for coding agents with reasoning separation and MCP tool integration
Large Language Models (LLMs)MLOps & InfrastructureAI HardwareCreative IndustriesOpen Source

More from Independent/Open Source

Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

2026-03-24
Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

2026-03-07
Independent/Open SourceIndependent/Open Source
OPEN SOURCE

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

2026-03-06

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us