Researcher Successfully Runs 397B-Parameter Qwen Model on MacBook Pro Using Apple's 'LLM in a Flash' Technique

Key Takeaways

▸A 397B-parameter LLM can run on a MacBook Pro with just 48GB RAM using flash-memory-based inference optimization, processing at 5.5+ tokens/second
▸Apple's 2023 'LLM in a Flash' paper provides the theoretical foundation for efficient streaming of model weights from SSD to DRAM on-demand
▸MoE architectures are particularly amenable to this approach since only a subset of expert weights must be active per token, enabling dramatic memory savings through 2-bit quantization

Source:

Hacker Newshttps://simonwillison.net/2026/Mar/18/llm-in-a-flash/↗

Summary

A researcher has demonstrated the viability of running Alibaba's massive Qwen3.5-397B-A17B model locally on a 48GB MacBook Pro M3 Max at 5.5+ tokens/second by implementing Apple's "LLM in a Flash" inference optimization technique. The breakthrough leverages Qwen's Mixture-of-Experts (MoE) architecture, which allows expert weights to be streamed from SSD into RAM on-demand rather than requiring all 120GB of quantized model weights to fit in memory simultaneously. The researcher used Claude to conduct 90 automated experiments to optimize MLX Objective-C and Metal code for maximum efficiency, ultimately keeping only 5.5GB of critical parameters (embeddings and routing matrices) in resident memory while streaming 2-bit quantized experts from flash storage.

The implementation reduces the number of active experts per token from the typical 10 to just 4, with the researcher noting that output quality at 2-bit quantization proved "indistinguishable from 4-bit" according to Claude's evaluations. The complete code and a paper documenting the experiment have been published open-source, making this optimization technique accessible to others seeking to run large language models on resource-constrained devices. This work demonstrates the practical feasibility of sophisticated inference optimization strategies that challenge the assumption that massive models require proportionally massive hardware resources.

Editorial Opinion

This demonstration is an impressive engineering achievement that shows how principled optimization of memory access patterns can overcome fundamental hardware constraints. However, the evaluation methodology described appears cursory—claims about quality equivalence between 2-bit and 4-bit quantization would benefit from more rigorous benchmarking against standard datasets. If these results hold up under scrutiny, they suggest a path toward making cutting-edge large language models accessible on consumer hardware, which could democratize AI capabilities but would also require careful consideration of deployment security and responsible use.

Researcher Successfully Runs 397B-Parameter Qwen Model on MacBook Pro Using Apple's 'LLM in a Flash' Technique

Key Takeaways

▸A 397B-parameter LLM can run on a MacBook Pro with just 48GB RAM using flash-memory-based inference optimization, processing at 5.5+ tokens/second
▸Apple's 2023 'LLM in a Flash' paper provides the theoretical foundation for efficient streaming of model weights from SSD to DRAM on-demand
▸MoE architectures are particularly amenable to this approach since only a subset of expert weights must be active per token, enabling dramatic memory savings through 2-bit quantization

Summary

Editorial Opinion

This demonstration is an impressive engineering achievement that shows how principled optimization of memory access patterns can overcome fundamental hardware constraints. However, the evaluation methodology described appears cursory—claims about quality equivalence between 2-bit and 4-bit quantization would benefit from more rigorous benchmarking against standard datasets. If these results hold up under scrutiny, they suggest a path toward making cutting-edge large language models accessible on consumer hardware, which could democratize AI capabilities but would also require careful consideration of deployment security and responsible use.

Researcher Successfully Runs 397B-Parameter Qwen Model on MacBook Pro Using Apple's 'LLM in a Flash' Technique

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Study Reveals AI Agent Memory Retrieval Accuracy at Just 9%, Exposing Infrastructure Challenges

Anthropic Receives Cease and Desist Over Claude Desktop Privacy Violations

Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Researcher Successfully Runs 397B-Parameter Qwen Model on MacBook Pro Using Apple's 'LLM in a Flash' Technique

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Study Reveals AI Agent Memory Retrieval Accuracy at Just 9%, Exposing Infrastructure Challenges

Anthropic Receives Cease and Desist Over Claude Desktop Privacy Violations

Research: How URLs in Prompts Can Influence LLM Outputs Toward Training Data

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains