Meta Introduces MLX Delegate for GPU-Accelerated PyTorch Inference on Apple Silicon

Key Takeaways

▸3-6x throughput improvement for generative AI workloads on Apple Silicon compared to existing ExecuTorch CPU/AOTI backends
▸Native integration with PyTorch 2 export pipeline and TorchAO quantization enables compatibility with latest models and techniques
▸Supports 90+ ATen operations covering dense transformers, MoE models, and speech-to-text inference

Source:

Hacker Newshttps://pytorch.org/blog/running-pytorch-models-on-apple-silicon-gpus-with-the-executorch-mlx-delegate/↗

Summary

Meta has released the ExecuTorch MLX Delegate, a new backend that enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs using Apple's MLX framework. The delegate integrates seamlessly with the PyTorch 2 export stack and requires only the standard ExecuTorch workflow: exporting models with torch.export, lowering with MLXPartitioner, and running the resulting .pte file with the ExecuTorch runtime.

The MLX delegate delivers significant performance improvements, achieving 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch backends on macOS by dispatching operations to MLX's optimized Metal GPU kernels. It supports around 90 ATen operations covering the full range of transformer inference needs, including quantized matrix multiplication, multi-head attention, rotary position embeddings, mixture-of-experts routing, and recurrent state-space operations.

The delegate enables execution of diverse model architectures including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts models, and speech-to-text systems (Whisper, Voxtral, Parakeet) for both offline and real-time transcription. It supports multiple precision and quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4), enabling single quantized model definitions to target multiple backends through ExecuTorch's unified runtime API across XNNPACK, CoreML, Vulkan, CUDA, and MLX backends.

Enables portable applications through unified ExecuTorch C++ and Python runtime API across multiple hardware backends
Supports diverse quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4) for flexible model optimization

Meta Introduces MLX Delegate for GPU-Accelerated PyTorch Inference on Apple Silicon

Key Takeaways

▸3-6x throughput improvement for generative AI workloads on Apple Silicon compared to existing ExecuTorch CPU/AOTI backends
▸Native integration with PyTorch 2 export pipeline and TorchAO quantization enables compatibility with latest models and techniques
▸Supports 90+ ATen operations covering dense transformers, MoE models, and speech-to-text inference

Source:

Hacker Newshttps://pytorch.org/blog/running-pytorch-models-on-apple-silicon-gpus-with-the-executorch-mlx-delegate/↗

Summary

Enables portable applications through unified ExecuTorch C++ and Python runtime API across multiple hardware backends
Supports diverse quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4) for flexible model optimization

Meta Introduces MLX Delegate for GPU-Accelerated PyTorch Inference on Apple Silicon

Key Takeaways

Summary

More from Meta

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Meta AI Chief Claims New LLM Model Has Caught Up with OpenAI's Flagship

Explaining Attention Mechanisms in Transformers Through Program Synthesis

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Meta Introduces MLX Delegate for GPU-Accelerated PyTorch Inference on Apple Silicon

Key Takeaways

Summary

More from Meta

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Meta AI Chief Claims New LLM Model Has Caught Up with OpenAI's Flagship

Explaining Attention Mechanisms in Transformers Through Program Synthesis

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment