MLX-Serve: New Native LLM Runtime Brings Fast AI Inference to Mac
Key Takeaways
- ▸MLX-Serve delivers native LLM inference for Apple Silicon with 220 tokens/sec prefill and 37 tokens/sec decode performance
- ▸The runtime is built entirely in Zig and Swift with no Python dependencies, targeting minimal overhead and maximum efficiency
- ▸OpenAI-compatible API ensures drop-in compatibility with existing client libraries and tools
Summary
MLX-Serve is a new open-source inference server built natively for Apple Silicon Macs, offering developers a lightweight alternative to cloud-based LLM deployment. Written entirely in Zig with direct bindings to MLX-C, the runtime eliminates Python dependencies while delivering impressive performance metrics: 220 tokens/second for prefill and 37 tokens/second for decode operations. The project includes both a command-line server with an OpenAI-compatible API and a native macOS menu bar application for easy model management and interaction.
The toolset supports a wide range of popular models including Meta's Llama series, Mistral AI models, Google's Gemma, and Alibaba's MoE variants, all in quantized MLX format. Users can download models directly from HuggingFace through the desktop application with resumable transfers. MLX-Serve distinguishes itself with built-in capabilities including real-time SSE streaming, tool calling, function execution, and a prompt-based skill system that allows users to extend agent capabilities by dropping markdown files—no coding required.
The project is MIT-licensed and ready to use immediately, with installation via pre-built releases or compilation from source. With zero Python dependencies and a complete inference stack built from scratch, MLX-Serve positions itself as a production-ready solution for running local AI applications on Mac hardware.
- Native macOS application with menu bar integration provides an accessible UI for model management and chat interactions
- Extensible agent framework allows users to add new capabilities through markdown-based skill definitions without code
Editorial Opinion
MLX-Serve represents an important step toward making on-device AI more practical and accessible for developers. By removing the Python runtime overhead and providing native performance optimizations for Apple Silicon, this project demonstrates that local inference doesn't require heavy cloud dependencies. The inclusion of agentic capabilities through simple markdown skills could significantly lower the barrier to building sophisticated AI applications on personal hardware.


