Researchers Successfully Train Transformer Models on Apple's M4 Neural Engine, Overcoming Hardware Inference Limitations
Key Takeaways
- ▸Full transformer training (forward, backward, gradient updates) is viable on Apple's M4 Neural Engine despite being optimized for inference
- ▸The ANE's static computation graph architecture requires creative workarounds: CPU handles branching/indexing operations, weight gradients, and Adam updates; dynamic weights passed as data eliminate recompilation overhead
- ▸The final dynamic weights pipeline achieved practical training performance by fundamentally changing how weights are supplied to the kernel—a paradigm shift from embedding constants to treating them as streaming input data
Summary
Researchers have demonstrated full transformer model training—including forward pass, backward pass, and gradient computation—directly on Apple's M4 Neural Engine, hardware originally designed for inference only. The breakthrough involved training a 109M-parameter model from scratch and scaling up to Qwen3-0.6B (596M parameters) with grouped-query attention. The work required three iterative pipeline refinements to overcome fundamental constraints, culminating in a dynamic weights approach that eliminated costly kernel recompilation overhead and achieved practical training efficiency on edge hardware.
The research reveals the ANE's capabilities and limitations for training workloads. The hardware executes static computation graphs without branching or runtime indexing, requiring CPU offload of weight gradient updates and Adam optimizer steps. Initial attempts embedding weights as compile-time constants required recompilation after every weight update, causing memory leaks and process crashes after ~119 compiles. The final pipeline solution passes weights as input data through IOSurface spatial dimensions, enabling single compilation with dynamic weight updates, eliminating the recompilation bottleneck.
- The research demonstrates edge AI potential and reverse-engineering of Apple's proprietary neural hardware through low-level API access and iterative optimization
Editorial Opinion
This technical achievement represents a significant validation of edge-based AI training potential, though it also highlights the fundamental tension between inference-optimized hardware and the flexibility required for training workloads. While the M4 ANE's 19 TFLOPS FP16 performance and 6.6 TFLOPS/W efficiency are impressive, the researchers' need for three major architectural revisions to achieve training demonstrates that inference hardware requires substantial creative engineering to support learning tasks. The research opens intriguing possibilities for on-device training and fine-tuning, though practical applications will likely remain limited to smaller models and lighter training scenarios compared to server-grade hardware.



