OMLX v0.3.9 Stable Release Introduces Native Multi-Token Prediction and Major Stability Improvements
Key Takeaways
- ▸Native Multi-Token Prediction enables simultaneous multi-token generation, significantly boosting inference speed for Qwen, Gemma, and DeepSeek models
- ▸Full DeepSeek V4 Pro/Flash support with integrated tool calling, enabling seamless integration with Claude Code and other AI agents
- ▸Major stability improvements for low-memory systems through memory enforcer, cache optimization, and jetsam prevention
Summary
OMLX v0.3.9 has reached stable release, introducing native Multi-Token Prediction (MTP) support for popular open-source language models including Qwen 3.5/3.6, Gemma 4, and DeepSeek-V4. When enabled per-model in admin settings, MTP allows supported models to predict multiple tokens simultaneously for faster inference decoding. For Gemma 4, MTP support extends to the vision path, delivering noticeably faster image and text request processing.
The release delivers comprehensive support for DeepSeek V4 Pro and Flash models with full model implementation, PoolingCache support, and integrated SSD caching to prevent silent corruption across prefix-cache hits. Notably, V4 tool calling now works end-to-end with DSML-format parsing on OpenAI and Anthropic endpoints, enabling seamless integration with Claude Code and other AI coding agents. The release also brings Gemma 4 support to the DFlash inference engine alongside a new chunked prefill mechanism that prevents long-context prompts from blocking concurrent inference requests.
V0.3.9 prioritizes stability and resource efficiency, particularly for low-memory systems. Key improvements include a memory enforcer that prevents out-of-memory crashes before they occur, hot-cache eviction race fixes, parallelized SSD-to-cache preloading, and per-model cache hit-rate visibility. Additional features include ParoQuant quantization support with pluggable custom loaders, one-command coding agent launchers, and concurrent admin chat support.
- Chunked prefill mechanism allows long-context processing without blocking concurrent inference requests



