Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon

Key Takeaways

▸Per-layer mixed-precision quantization achieves 3.1x average compression vs bf16 while maintaining model capability through selective bit allocation
▸16 production-ready models available on Hugging Face, optimized for Apple Silicon; Qwen3.6-27B reaches Capability Score 83.0 at 17.5 GB
▸Complete local inference pipeline including OptIQ Lab GUI, OpenAI/Anthropic API compatibility, vision model support, and speculative decoding

Source:

Hacker Newshttps://mlx-optiq.com/↗

Summary

Apple has launched mlx-optiq, a Python toolkit that enables efficient quantization, fine-tuning, and deployment of large language models directly on Apple Silicon (M1-M5 chips). The tool uses per-layer sensitivity analysis via KL-divergence to apply mixed-precision quantization, keeping sensitive layers at higher precision while compressing robust layers to 4-bit, achieving 3.1x average compression compared to bf16. Users can run powerful LLMs locally on their Macs without GPU clusters or API keys, with OptIQ Lab providing a graphical workbench for model management and serving.

The toolkit ships with 16 pre-built quantized production models on Hugging Face, including Google's Gemma-4, NVIDIA's Nemotron 3, and Qwen models ranging from 1B to 35B parameters. Flagship models like Qwen3.6-27B achieve a Capability Score of 83.0 in just 17.5 GB, while Qwen3.5-9B fits in 6.6 GB and runs at 90 tokens/second completely offline. The toolkit integrates seamlessly with stock MLX tools and offers OpenAI and Anthropic API compatibility, allowing users to point tools like Claude Code to their local quantized models with full vision support.

Offline operation with no cloud dependency; Qwen3.5-9B runs in 6.6 GB with 90 tokens/second and 64k context support

Editorial Opinion

MLX-OptIQ democratizes on-device AI inference for Mac users, making frontier-class models practical without cloud infrastructure. The per-layer sensitivity approach elegantly solves the compression-capability tradeoff, showing that smart quantization can preserve performance better than uniform bit allocation. This toolkit could establish Apple Silicon as a serious platform for private, low-latency LLM applications.

Apple

PRODUCT LAUNCH Apple2026-06-14

Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon

Key Takeaways

▸Per-layer mixed-precision quantization achieves 3.1x average compression vs bf16 while maintaining model capability through selective bit allocation
▸16 production-ready models available on Hugging Face, optimized for Apple Silicon; Qwen3.6-27B reaches Capability Score 83.0 at 17.5 GB
▸Complete local inference pipeline including OptIQ Lab GUI, OpenAI/Anthropic API compatibility, vision model support, and speculative decoding

Source:

Hacker Newshttps://mlx-optiq.com/↗

Summary

Offline operation with no cloud dependency; Qwen3.5-9B runs in 6.6 GB with 90 tokens/second and 64k context support

Editorial Opinion

MLX-OptIQ democratizes on-device AI inference for Mac users, making frontier-class models practical without cloud infrastructure. The per-layer sensitivity approach elegantly solves the compression-capability tradeoff, showing that smart quantization can preserve performance better than uniform bit allocation. This toolkit could establish Apple Silicon as a serious platform for private, low-latency LLM applications.

Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon

Key Takeaways

Summary

Editorial Opinion

More from Apple

Apple Surpasses Nvidia as Second $5 Trillion Company on Restrained AI Spending Strategy

Apple Reclaims Most Valuable Company Crown as Investors Reward Measured AI Investment Approach

Hardware-Level Solution Proposed for Regulatory AI Assistant Interoperability on Apple and Android

Comments

Suggested

How Moonshot AI Obtained Prohibited Nvidia Blackwell Chips for Kimi K3 Training

Claude Opus 5 Engages in Sophisticated Deception in Vending Machine Simulation

Self-Improving Agents Achieve Up to 16% Speed Gains on Major LLM Inference

Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon

Key Takeaways

Summary

Editorial Opinion

More from Apple

Apple Surpasses Nvidia as Second $5 Trillion Company on Restrained AI Spending Strategy

Apple Reclaims Most Valuable Company Crown as Investors Reward Measured AI Investment Approach

Hardware-Level Solution Proposed for Regulatory AI Assistant Interoperability on Apple and Android

Comments

Suggested

How Moonshot AI Obtained Prohibited Nvidia Blackwell Chips for Kimi K3 Training

Claude Opus 5 Engages in Sophisticated Deception in Vending Machine Simulation

Self-Improving Agents Achieve Up to 16% Speed Gains on Major LLM Inference