vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

▸vLLM-MLX enables local LLM inference on Apple Silicon at 65 tok/s on M3 Ultra, with peak speeds exceeding 400 tok/s
▸The server provides OpenAI and Anthropic-compatible APIs, supporting tool calling, multimodal models, and continuous batching
▸Persistent prompt caching delivers 10-15x speedups in multi-turn conversations by avoiding redundant token processing

Source:

Hacker Newshttps://github.com/raullenchai/vllm-mlx↗

Summary

A new open-source project called vLLM-MLX is enabling fast large language model inference on Apple Silicon devices, achieving speeds of up to 65 tokens per second on M3 Ultra hardware. Built on the MLX framework, the tool provides an OpenAI and Anthropic-compatible server that runs entirely on Mac computers, supporting models like Llama, Qwen-VL, and LLaVA with features including continuous batching, tool calling, and multimodal capabilities.

The project, maintained by developer raullen as a fork of waybarrios/vllm-mlx, adds 37 commits with production-grade enhancements specifically designed for coding agents. Key improvements include robust tool calling support in both streaming and non-streaming modes, reasoning separation that cleanly isolates reasoning from content output, and persistent prompt caching that delivers 10-15x speedups in multi-turn conversations by saving over 20,000 tokens of prefill on cache hits.

The implementation supports the Model Context Protocol (MCP) for tool integration and works with various AI coding assistants. With reported speeds of up to 400+ tokens per second in optimal configurations and native MLX backend support, vLLM-MLX represents a significant advancement in making powerful LLM inference accessible on consumer Apple hardware without requiring cloud services or external GPUs.

The project is open source and specifically optimized for coding agents with reasoning separation and MCP tool integration

Independent/Open Source

OPEN SOURCE Independent/Open Source2026-02-26

vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

▸vLLM-MLX enables local LLM inference on Apple Silicon at 65 tok/s on M3 Ultra, with peak speeds exceeding 400 tok/s
▸The server provides OpenAI and Anthropic-compatible APIs, supporting tool calling, multimodal models, and continuous batching
▸Persistent prompt caching delivers 10-15x speedups in multi-turn conversations by avoiding redundant token processing

Source:

Hacker Newshttps://github.com/raullenchai/vllm-mlx↗

Summary

The project is open source and specifically optimized for coding agents with reasoning separation and MCP tool integration

vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

Summary

More from Independent/Open Source

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

Comments

Suggested

Utilix Launches Unified Tool Platform With 145+ Utilities for Developers and AI Agents

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

Summary

More from Independent/Open Source

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

Comments

Suggested

Utilix Launches Unified Tool Platform With 145+ Utilities for Developers and AI Agents

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA