llama.cpp Achieves 2-2.5× Speedups with Multi-Token Prediction on Consumer Hardware
Key Takeaways
- ▸llama.cpp now supports Multi-Token Prediction speculative decoding, configurable via --spec-draft-n-max N, with benchmark-proven speedups of 1.81-2.44× on consumer hardware
- ▸Implementation adds negligible VRAM overhead (fraction of 1GB) by having the draft head share the main model's embeddings, KV cache, and tokenizer—a significant advantage over traditional speculative decoding
- ▸Performance gains vary by hardware: memory-constrained systems like Strix Halo see larger relative improvements; power-rich systems like RTX 3090 see smaller gains due to larger available headroom
Summary
llama.cpp merged PR #22673 on May 16, introducing first-class Multi-Token Prediction (MTP) speculative decoding support that allows models with an MTP head to draft and verify multiple tokens in a single forward pass instead of generating one token per pass. Independent benchmarking reveals substantial speedups on consumer hardware: Qwen3.6 27B achieved 1.81× speedup on Strix Halo with Q4_K_M quantization (11.7→21.2 tok/s) and 2.44× on Q8_0 (7.4→18.1 tok/s). An RTX 3090 at full 450W power budget showed more modest but still significant 1.54× gains (38.7→59.5 tok/s), with performance gains correlating to memory constraints rather than raw power availability.
The implementation requires minimal additional VRAM—only a fraction of a gigabyte—by having the main model use a small draft head that shares embeddings, KV cache, and tokenizer with the base model, eliminating the traditional cost of running a separate speculative draft model. Output quality remains identical to baseline; the verification step only accepts tokens the main model would have generated anyway, ensuring bit-identical output at temperature 0 and statistically equivalent output at higher temperatures. The feature is enabled via --spec-type draft-mtp --spec-draft-n-max N, with configurable aggressiveness achieving approximately 75% token acceptance at N=3 on Qwen3.6 27B.
- Output quality is completely preserved—speculative decoding only accepts tokens the main model would generate, maintaining accuracy while trading wall-clock time
Editorial Opinion
Multi-Token Prediction represents a genuine democratization of speculative decoding by eliminating the VRAM tax that historically locked inference acceleration behind enterprise-grade hardware. Conventional speculative decoding's requirement for a separate draft model effectively doubles memory overhead, making it impractical for consumer GPUs. llama.cpp's integrated approach achieves the same speedups with negligible overhead, unlocking 2-2.5× improvements on systems that desperately need them. The hardware-dependent gains suggest this will become standard practice for memory-bottlenecked inference pipelines.



