Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon
Key Takeaways
- ▸Per-layer mixed-precision quantization achieves 3.1x average compression vs bf16 while maintaining model capability through selective bit allocation
- ▸16 production-ready models available on Hugging Face, optimized for Apple Silicon; Qwen3.6-27B reaches Capability Score 83.0 at 17.5 GB
- ▸Complete local inference pipeline including OptIQ Lab GUI, OpenAI/Anthropic API compatibility, vision model support, and speculative decoding
Summary
Apple has launched mlx-optiq, a Python toolkit that enables efficient quantization, fine-tuning, and deployment of large language models directly on Apple Silicon (M1-M5 chips). The tool uses per-layer sensitivity analysis via KL-divergence to apply mixed-precision quantization, keeping sensitive layers at higher precision while compressing robust layers to 4-bit, achieving 3.1x average compression compared to bf16. Users can run powerful LLMs locally on their Macs without GPU clusters or API keys, with OptIQ Lab providing a graphical workbench for model management and serving.
The toolkit ships with 16 pre-built quantized production models on Hugging Face, including Google's Gemma-4, NVIDIA's Nemotron 3, and Qwen models ranging from 1B to 35B parameters. Flagship models like Qwen3.6-27B achieve a Capability Score of 83.0 in just 17.5 GB, while Qwen3.5-9B fits in 6.6 GB and runs at 90 tokens/second completely offline. The toolkit integrates seamlessly with stock MLX tools and offers OpenAI and Anthropic API compatibility, allowing users to point tools like Claude Code to their local quantized models with full vision support.
- Offline operation with no cloud dependency; Qwen3.5-9B runs in 6.6 GB with 90 tokens/second and 64k context support
Editorial Opinion
MLX-OptIQ democratizes on-device AI inference for Mac users, making frontier-class models practical without cloud infrastructure. The per-layer sensitivity approach elegantly solves the compression-capability tradeoff, showing that smart quantization can preserve performance better than uniform bit allocation. This toolkit could establish Apple Silicon as a serious platform for private, low-latency LLM applications.



