BotBeat
...
← Back

> ▌

OMLXOMLX
UPDATEOMLX2026-05-22

OMLX v0.3.9 Stable Release Introduces Native Multi-Token Prediction and Major Stability Improvements

Key Takeaways

  • ▸Native Multi-Token Prediction enables simultaneous multi-token generation, significantly boosting inference speed for Qwen, Gemma, and DeepSeek models
  • ▸Full DeepSeek V4 Pro/Flash support with integrated tool calling, enabling seamless integration with Claude Code and other AI agents
  • ▸Major stability improvements for low-memory systems through memory enforcer, cache optimization, and jetsam prevention
Source:
Hacker Newshttps://github.com/jundot/omlx/releases/tag/v0.3.9↗

Summary

OMLX v0.3.9 has reached stable release, introducing native Multi-Token Prediction (MTP) support for popular open-source language models including Qwen 3.5/3.6, Gemma 4, and DeepSeek-V4. When enabled per-model in admin settings, MTP allows supported models to predict multiple tokens simultaneously for faster inference decoding. For Gemma 4, MTP support extends to the vision path, delivering noticeably faster image and text request processing.

The release delivers comprehensive support for DeepSeek V4 Pro and Flash models with full model implementation, PoolingCache support, and integrated SSD caching to prevent silent corruption across prefix-cache hits. Notably, V4 tool calling now works end-to-end with DSML-format parsing on OpenAI and Anthropic endpoints, enabling seamless integration with Claude Code and other AI coding agents. The release also brings Gemma 4 support to the DFlash inference engine alongside a new chunked prefill mechanism that prevents long-context prompts from blocking concurrent inference requests.

V0.3.9 prioritizes stability and resource efficiency, particularly for low-memory systems. Key improvements include a memory enforcer that prevents out-of-memory crashes before they occur, hot-cache eviction race fixes, parallelized SSD-to-cache preloading, and per-model cache hit-rate visibility. Additional features include ParoQuant quantization support with pluggable custom loaders, one-command coding agent launchers, and concurrent admin chat support.

  • Chunked prefill mechanism allows long-context processing without blocking concurrent inference requests
Large Language Models (LLMs)Generative AIMLOps & InfrastructureOpen Source

More from OMLX

OMLXOMLX
PRODUCT LAUNCH

OMLX Brings Fast Local LLM Inference to Mac with Innovative SSD-Based KV Caching

2026-03-31

Comments

Suggested

MetaMeta
RESEARCH

Researchers Expose Critical Blind Spot in AI Safety Systems: Domain-Camouflaged Attacks Defeat Leading Injection Detectors

2026-05-22
SteelSpineSteelSpine
PRODUCT LAUNCH

SteelSpine Launches Cryptographically Verified Agent Debugging Platform

2026-05-22
OpenAIOpenAI
INDUSTRY REPORT

Frontier labs don't use most AI compute (yet)

2026-05-22
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us