Meta Launches MLX Delegate for ExecuTorch: GPU-Accelerated PyTorch on Apple Silicon
Key Takeaways
- ▸3-6x performance improvement: MLX Delegate delivers significantly higher throughput for generative AI workloads compared to existing ExecuTorch backends on macOS
- ▸PyTorch 2 native integration: Directly leverages torch.export and TorchAO quantization tools, enabling automatic support for new models and techniques as they land in PyTorch
- ▸Flexible quantization and cross-platform portability: Supports multiple precision options that work across multiple ExecuTorch backends, enabling single-model deployment across different hardware platforms
Summary
Meta has released the MLX Delegate, a new backend for ExecuTorch that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs. The delegate seamlessly integrates with PyTorch 2's export stack and leverages Apple's MLX framework to deliver optimized Metal GPU kernels, achieving 3-6x higher throughput compared to existing ExecuTorch backends on macOS.
The MLX Delegate supports a comprehensive range of operations essential for transformer inference, including quantized matrix multiplication, multi-head attention, rotary position embeddings, and mixture-of-experts routing. It provides multiple precision and quantization options—BF16, FP16, FP32, 2/4/8-bit affine quantization, and NVIDIA's NVFP4—allowing developers to optimize for both performance and model size on resource-constrained Apple Silicon devices.
The delegate has been validated across diverse model architectures including dense transformers like Llama, Qwen, and Gemma, sparse Mixture-of-Experts models, and speech-to-text systems such as Whisper and Voxtral. By plugging directly into the PyTorch 2 export ecosystem, the MLX Delegate enables developers to target multiple backends with a single quantized model, and provides a portable runtime API that works across MLX, XNNPACK, CoreML, Vulkan, and CUDA without requiring application-level changes. The delegate is currently experimental and under active development.
Editorial Opinion
This is a strategic move that strengthens PyTorch's ecosystem for on-device AI inference on Apple Silicon, a critical growth area as developers increasingly seek to run powerful models locally. By tightly integrating with PyTorch 2's export infrastructure rather than creating a standalone tool, Meta has positioned the MLX Delegate to automatically benefit from future PyTorch advancements. The 3-6x performance gains are significant enough to make Apple Silicon a viable platform for production inference workloads. While the experimental status warrants cautious adoption initially, this demonstrates Meta's commitment to supporting diverse hardware platforms through ExecuTorch.



