BotBeat
...
← Back

> ▌

AppleApple
PRODUCT LAUNCHApple2026-06-14

Apple Releases MLX-OptIQ: Per-Layer Mixed-Precision Quantization for LLMs on Apple Silicon

Key Takeaways

  • ▸Per-layer mixed-precision quantization achieves 3.1x average compression vs bf16 while maintaining model capability through selective bit allocation
  • ▸16 production-ready models available on Hugging Face, optimized for Apple Silicon; Qwen3.6-27B reaches Capability Score 83.0 at 17.5 GB
  • ▸Complete local inference pipeline including OptIQ Lab GUI, OpenAI/Anthropic API compatibility, vision model support, and speculative decoding
Source:
Hacker Newshttps://mlx-optiq.com/↗

Summary

Apple has launched mlx-optiq, a Python toolkit that enables efficient quantization, fine-tuning, and deployment of large language models directly on Apple Silicon (M1-M5 chips). The tool uses per-layer sensitivity analysis via KL-divergence to apply mixed-precision quantization, keeping sensitive layers at higher precision while compressing robust layers to 4-bit, achieving 3.1x average compression compared to bf16. Users can run powerful LLMs locally on their Macs without GPU clusters or API keys, with OptIQ Lab providing a graphical workbench for model management and serving.

The toolkit ships with 16 pre-built quantized production models on Hugging Face, including Google's Gemma-4, NVIDIA's Nemotron 3, and Qwen models ranging from 1B to 35B parameters. Flagship models like Qwen3.6-27B achieve a Capability Score of 83.0 in just 17.5 GB, while Qwen3.5-9B fits in 6.6 GB and runs at 90 tokens/second completely offline. The toolkit integrates seamlessly with stock MLX tools and offers OpenAI and Anthropic API compatibility, allowing users to point tools like Claude Code to their local quantized models with full vision support.

  • Offline operation with no cloud dependency; Qwen3.5-9B runs in 6.6 GB with 90 tokens/second and 64k context support

Editorial Opinion

MLX-OptIQ democratizes on-device AI inference for Mac users, making frontier-class models practical without cloud infrastructure. The per-layer sensitivity approach elegantly solves the compression-capability tradeoff, showing that smart quantization can preserve performance better than uniform bit allocation. This toolkit could establish Apple Silicon as a serious platform for private, low-latency LLM applications.

Large Language Models (LLMs)Machine LearningAI HardwareOpen Source

More from Apple

AppleApple
PARTNERSHIP

Apple Partners with Google to Supercharge Siri with Gemini AI and Private Cloud Compute

2026-06-12
AppleApple
POLICY & REGULATION

Apple's Siri AI Delayed in EU Due to DMA Regulatory Requirements

2026-06-12
AppleApple
PRODUCT LAUNCH

Apple Unveils Privacy-First Siri AI Redesign for iOS 27

2026-06-11

Comments

Suggested

Truth Benchmark CommunityTruth Benchmark Community
OPEN SOURCE

Truth Benchmark: Open-Source Tool Systematically Detects Code-Documentation Mismatches

2026-06-14
AnthropicAnthropic
PARTNERSHIP

Anthropic Models Now Available Through Microsoft Enterprise Services as Subprocessor

2026-06-14
Google / AlphabetGoogle / Alphabet
INDUSTRY REPORT

AI Security Scanning Extends Vulnerability Detection to 'Long Tail' Software Projects

2026-06-14
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us