Llama.cpp b9180 Adds MTP Support with Advanced Speculative Decoding Optimization
Key Takeaways
- ▸MTP support enables efficient partial rollback for speculative decoding, eliminating wasteful full-context restarts
- ▸Multi-backend optimization across Metal, Vulkan, CUDA, ROCm, OpenVINO, and SYCL ensures broad hardware support
- ▸Gated Delta Networks (GDN) intermediate state storage enables selective token rollback up to configurable limits
Source:
Summary
The open-source llama.cpp project has merged MTP support in commit b9180, introducing advanced speculative decoding capabilities for optimized Llama model inference. The update includes sophisticated partial rollback mechanisms for Gated Delta Networks (GDN), enabling more efficient draft token management without costly full-context restarts. The implementation spans multiple backend platforms including Metal (macOS Apple Silicon, with KleidiAI optimization), Vulkan, CUDA (versions 12 & 13), ROCm, OpenVINO, and SYCL. The feature allows intermediate state checkpointing and selective rollback up to draft_max tokens, reducing computational waste in speculative decoding workflows.
- Compatibility verified with n-gram and other speculative decoding methods for flexible deployment



