BotBeat
...
← Back

> ▌

Independent/Open SourceIndependent/Open Source
OPEN SOURCEIndependent/Open Source2026-02-26

vLLM-MLX Brings High-Speed LLM Inference to Apple Silicon with 65 Tokens Per Second

Key Takeaways

  • ▸vLLM-MLX enables local LLM inference on Apple Silicon at 65 tok/s on M3 Ultra, with peak speeds exceeding 400 tok/s
  • ▸The server provides OpenAI and Anthropic-compatible APIs, supporting tool calling, multimodal models, and continuous batching
  • ▸Persistent prompt caching delivers 10-15x speedups in multi-turn conversations by avoiding redundant token processing
Source:
Hacker Newshttps://github.com/raullenchai/vllm-mlx↗

Summary

A new open-source project called vLLM-MLX is enabling fast large language model inference on Apple Silicon devices, achieving speeds of up to 65 tokens per second on M3 Ultra hardware. Built on the MLX framework, the tool provides an OpenAI and Anthropic-compatible server that runs entirely on Mac computers, supporting models like Llama, Qwen-VL, and LLaVA with features including continuous batching, tool calling, and multimodal capabilities.

The project, maintained by developer raullen as a fork of waybarrios/vllm-mlx, adds 37 commits with production-grade enhancements specifically designed for coding agents. Key improvements include robust tool calling support in both streaming and non-streaming modes, reasoning separation that cleanly isolates reasoning from content output, and persistent prompt caching that delivers 10-15x speedups in multi-turn conversations by saving over 20,000 tokens of prefill on cache hits.

The implementation supports the Model Context Protocol (MCP) for tool integration and works with various AI coding assistants. With reported speeds of up to 400+ tokens per second in optimal configurations and native MLX backend support, vLLM-MLX represents a significant advancement in making powerful LLM inference accessible on consumer Apple hardware without requiring cloud services or external GPUs.

  • The project is open source and specifically optimized for coding agents with reasoning separation and MCP tool integration
Large Language Models (LLMs)MLOps & InfrastructureAI HardwareCreative IndustriesOpen Source

More from Independent/Open Source

Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

ArrowJS: A Lightweight UI Framework Purpose-Built for AI Agents

2026-03-24
Independent/Open SourceIndependent/Open Source
PRODUCT LAUNCH

SYNX Configuration Format Promises 67× Faster Parsing Than YAML for AI Pipelines

2026-03-07
Independent/Open SourceIndependent/Open Source
OPEN SOURCE

Squawk: Open-Source Tool Detects Behavioral Anti-Patterns in AI Coding Agents

2026-03-06

Comments

Suggested

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Executive Office of the President of the United States (Policy/Regulation)Executive Office of the President of the United States (Policy/Regulation)
RESEARCH

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

2026-05-20
OpenAIOpenAI
RESEARCH

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us