BotBeat
...
← Back

> ▌

AppleApple
RESEARCHApple2026-03-22

Researchers Successfully Train Transformer Models on Apple's M4 Neural Engine, Overcoming Hardware Inference Limitations

Key Takeaways

  • ▸Full transformer training (forward, backward, gradient updates) is viable on Apple's M4 Neural Engine despite being optimized for inference
  • ▸The ANE's static computation graph architecture requires creative workarounds: CPU handles branching/indexing operations, weight gradients, and Adam updates; dynamic weights passed as data eliminate recompilation overhead
  • ▸The final dynamic weights pipeline achieved practical training performance by fundamentally changing how weights are supplied to the kernel—a paradigm shift from embedding constants to treating them as streaming input data
Source:
Hacker Newshttps://maderix.substack.com/p/inside-the-m4-apple-neural-engine-c8b↗

Summary

Researchers have demonstrated full transformer model training—including forward pass, backward pass, and gradient computation—directly on Apple's M4 Neural Engine, hardware originally designed for inference only. The breakthrough involved training a 109M-parameter model from scratch and scaling up to Qwen3-0.6B (596M parameters) with grouped-query attention. The work required three iterative pipeline refinements to overcome fundamental constraints, culminating in a dynamic weights approach that eliminated costly kernel recompilation overhead and achieved practical training efficiency on edge hardware.

The research reveals the ANE's capabilities and limitations for training workloads. The hardware executes static computation graphs without branching or runtime indexing, requiring CPU offload of weight gradient updates and Adam optimizer steps. Initial attempts embedding weights as compile-time constants required recompilation after every weight update, causing memory leaks and process crashes after ~119 compiles. The final pipeline solution passes weights as input data through IOSurface spatial dimensions, enabling single compilation with dynamic weight updates, eliminating the recompilation bottleneck.

  • The research demonstrates edge AI potential and reverse-engineering of Apple's proprietary neural hardware through low-level API access and iterative optimization

Editorial Opinion

This technical achievement represents a significant validation of edge-based AI training potential, though it also highlights the fundamental tension between inference-optimized hardware and the flexibility required for training workloads. While the M4 ANE's 19 TFLOPS FP16 performance and 6.6 TFLOPS/W efficiency are impressive, the researchers' need for three major architectural revisions to achieve training demonstrates that inference hardware requires substantial creative engineering to support learning tasks. The research opens intriguing possibilities for on-device training and fine-tuning, though practical applications will likely remain limited to smaller models and lighter training scenarios compared to server-grade hardware.

Machine LearningDeep LearningAI Hardware

More from Apple

AppleApple
PRODUCT LAUNCH

Apple Launches Revamped Siri with Auto-Deleting Chats, Powered by Google Gemini

2026-05-18
AppleApple
INDUSTRY REPORT

Apple Opens Door to AI Agents: App Store Policy Shift and Siri Makeover Planned for iOS 27

2026-05-13
AppleApple
UPDATE

Apple Sales Coach Gets AI-Generated Video Presenters for Personalized Retail Training

2026-05-12

Comments

Suggested

Helmholtz MunichHelmholtz Munich
RESEARCH

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

2026-05-20
OpenAIOpenAI
RESEARCH

OpenAI Model Solves 80-Year-Old Planar Unit Distance Problem, Disproving Long-Held Mathematical Assumption

2026-05-20
NVIDIANVIDIA
POLICY & REGULATION

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us