BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-03-19

Researcher Successfully Runs 397B-Parameter Qwen Model on MacBook Pro Using Apple's 'LLM in a Flash' Technique

Key Takeaways

  • ▸A 397B-parameter LLM can run on a MacBook Pro with just 48GB RAM using flash-memory-based inference optimization, processing at 5.5+ tokens/second
  • ▸Apple's 2023 'LLM in a Flash' paper provides the theoretical foundation for efficient streaming of model weights from SSD to DRAM on-demand
  • ▸MoE architectures are particularly amenable to this approach since only a subset of expert weights must be active per token, enabling dramatic memory savings through 2-bit quantization
Source:
Hacker Newshttps://simonwillison.net/2026/Mar/18/llm-in-a-flash/↗

Summary

A researcher has demonstrated the viability of running Alibaba's massive Qwen3.5-397B-A17B model locally on a 48GB MacBook Pro M3 Max at 5.5+ tokens/second by implementing Apple's "LLM in a Flash" inference optimization technique. The breakthrough leverages Qwen's Mixture-of-Experts (MoE) architecture, which allows expert weights to be streamed from SSD into RAM on-demand rather than requiring all 120GB of quantized model weights to fit in memory simultaneously. The researcher used Claude to conduct 90 automated experiments to optimize MLX Objective-C and Metal code for maximum efficiency, ultimately keeping only 5.5GB of critical parameters (embeddings and routing matrices) in resident memory while streaming 2-bit quantized experts from flash storage.

The implementation reduces the number of active experts per token from the typical 10 to just 4, with the researcher noting that output quality at 2-bit quantization proved "indistinguishable from 4-bit" according to Claude's evaluations. The complete code and a paper documenting the experiment have been published open-source, making this optimization technique accessible to others seeking to run large language models on resource-constrained devices. This work demonstrates the practical feasibility of sophisticated inference optimization strategies that challenge the assumption that massive models require proportionally massive hardware resources.

Editorial Opinion

This demonstration is an impressive engineering achievement that shows how principled optimization of memory access patterns can overcome fundamental hardware constraints. However, the evaluation methodology described appears cursory—claims about quality equivalence between 2-bit and 4-bit quantization would benefit from more rigorous benchmarking against standard datasets. If these results hold up under scrutiny, they suggest a path toward making cutting-edge large language models accessible on consumer hardware, which could democratize AI capabilities but would also require careful consideration of deployment security and responsible use.

Large Language Models (LLMs)Generative AIMachine LearningMLOps & InfrastructureAI Hardware

More from Anthropic

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Security Researcher Exposes Critical Infrastructure After Following Claude's Configuration Advice Without Authentication

2026-04-05

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us