vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second
Key Takeaways
- ▸vLLM-MLX enables local LLM inference on Apple Silicon at 65 tok/s on M3 Ultra, with peak speeds exceeding 400 tok/s
- ▸The server provides OpenAI and Anthropic-compatible APIs, supporting tool calling, multimodal models, and continuous batching
- ▸Persistent prompt caching delivers 10-15x speedups in multi-turn conversations by avoiding redundant token processing
Summary
A new open-source project called vLLM-MLX is enabling fast large language model inference on Apple Silicon devices, achieving speeds of up to 65 tokens per second on M3 Ultra hardware. Built on the MLX framework, the tool provides an OpenAI and Anthropic-compatible server that runs entirely on Mac computers, supporting models like Llama, Qwen-VL, and LLaVA with features including continuous batching, tool calling, and multimodal capabilities.
The project, maintained by developer raullen as a fork of waybarrios/vllm-mlx, adds 37 commits with production-grade enhancements specifically designed for coding agents. Key improvements include robust tool calling support in both streaming and non-streaming modes, reasoning separation that cleanly isolates reasoning from content output, and persistent prompt caching that delivers 10-15x speedups in multi-turn conversations by saving over 20,000 tokens of prefill on cache hits.
The implementation supports the Model Context Protocol (MCP) for tool integration and works with various AI coding assistants. With reported speeds of up to 400+ tokens per second in optimal configurations and native MLX backend support, vLLM-MLX represents a significant advancement in making powerful LLM inference accessible on consumer Apple hardware without requiring cloud services or external GPUs.
- The project is open source and specifically optimized for coding agents with reasoning separation and MCP tool integration



