NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference
Key Takeaways
- ▸Nemotron 3 Super achieves 2.2x-7.5x higher inference throughput than competing open-source models while supporting 1M token context length
- ▸Novel technical innovations including LatentMoE for accuracy and MTP layers for native speculative decoding improve both performance and efficiency
- ▸Complete open-source release includes multiple model checkpoints, training datasets, and supporting artifacts, enabling community adoption and fine-tuning
Summary
NVIDIA announced the release of Nemotron 3 Super, a 12B active/120B total parameter Mixture-of-Experts hybrid Mamba-Transformer model that combines convolutional and attention-based mechanisms for improved efficiency and performance. The model introduces LatentMoE for enhanced accuracy, MTP layers for native speculative decoding, and is pretrained in NVFP4, a custom floating-point format optimized for NVIDIA hardware. Nemotron 3 Super achieves up to 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on long-context workloads (8k input / 64k output tokens), while maintaining comparable or superior accuracy across diverse benchmarks. The company is releasing the complete model stack—including pre-trained, post-trained, and quantized checkpoints in multiple formats (NVFP4, FP8, BF16), along with the training datasets and a technical report. The release also includes specialized pretraining and post-training datasets targeting code, logic, and agentic capabilities, as well as a GenRM model for RLHF fine-tuning.
- NVFP4 quantization and MoE architecture reduce computational requirements for deployment while maintaining model quality
Editorial Opinion
Nemotron 3 Super represents a significant step forward in making large-scale language models more practical for real-world deployment. By combining cutting-edge architectural innovations (LatentMoE, MTP layers, hybrid Mamba-Transformer) with aggressive quantization and open-source release, NVIDIA is directly addressing the deployment bottleneck that has limited the practical adoption of truly capable 120B+ parameter models. The performance gains—particularly the 7.5x speedup over Qwen3.5—could be transformative for latency-sensitive applications like real-time inference, while the open-source release signals NVIDIA's confidence in both the model quality and their hardware advantage in running these workloads efficiently.



