BotBeat
...
← Back

> ▌

AppleApple
RESEARCHApple2026-03-22

Researchers Successfully Train Transformer Models on Apple's M4 Neural Engine, Overcoming Hardware Inference Limitations

Key Takeaways

  • ▸Full transformer training (forward, backward, gradient updates) is viable on Apple's M4 Neural Engine despite being optimized for inference
  • ▸The ANE's static computation graph architecture requires creative workarounds: CPU handles branching/indexing operations, weight gradients, and Adam updates; dynamic weights passed as data eliminate recompilation overhead
  • ▸The final dynamic weights pipeline achieved practical training performance by fundamentally changing how weights are supplied to the kernel—a paradigm shift from embedding constants to treating them as streaming input data
Source:
Hacker Newshttps://maderix.substack.com/p/inside-the-m4-apple-neural-engine-c8b↗

Summary

Researchers have demonstrated full transformer model training—including forward pass, backward pass, and gradient computation—directly on Apple's M4 Neural Engine, hardware originally designed for inference only. The breakthrough involved training a 109M-parameter model from scratch and scaling up to Qwen3-0.6B (596M parameters) with grouped-query attention. The work required three iterative pipeline refinements to overcome fundamental constraints, culminating in a dynamic weights approach that eliminated costly kernel recompilation overhead and achieved practical training efficiency on edge hardware.

The research reveals the ANE's capabilities and limitations for training workloads. The hardware executes static computation graphs without branching or runtime indexing, requiring CPU offload of weight gradient updates and Adam optimizer steps. Initial attempts embedding weights as compile-time constants required recompilation after every weight update, causing memory leaks and process crashes after ~119 compiles. The final pipeline solution passes weights as input data through IOSurface spatial dimensions, enabling single compilation with dynamic weight updates, eliminating the recompilation bottleneck.

  • The research demonstrates edge AI potential and reverse-engineering of Apple's proprietary neural hardware through low-level API access and iterative optimization

Editorial Opinion

This technical achievement represents a significant validation of edge-based AI training potential, though it also highlights the fundamental tension between inference-optimized hardware and the flexibility required for training workloads. While the M4 ANE's 19 TFLOPS FP16 performance and 6.6 TFLOPS/W efficiency are impressive, the researchers' need for three major architectural revisions to achieve training demonstrates that inference hardware requires substantial creative engineering to support learning tasks. The research opens intriguing possibilities for on-device training and fine-tuning, though practical applications will likely remain limited to smaller models and lighter training scenarios compared to server-grade hardware.

Machine LearningDeep LearningAI Hardware

More from Apple

AppleApple
UPDATE

Apple MLX Introduces TurboQuant: Mixed Precision Quantization for Efficient On-Device ML

2026-04-04
AppleApple
INDUSTRY REPORT

Apple at 50: From Garage Rebel to Multitrillion-Dollar Empire, But Missing Recognition of Its Founders

2026-04-02
AppleApple
POLICY & REGULATION

Apple Releases Emergency iOS 18.7.7 Security Patch to Counter DarkSword Exploit

2026-04-01

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
Sweden Polytechnic InstituteSweden Polytechnic Institute
RESEARCH

Research Reveals Brevity Constraints Can Improve LLM Accuracy by Up to 26.3%

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us