BotBeat
...
← Back

> ▌

NVIDIANVIDIA
PRODUCT LAUNCHNVIDIA2026-06-01

NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference

Key Takeaways

  • ▸Nemotron 3 Super achieves 2.2x-7.5x higher inference throughput than competing open-source models while supporting 1M token context length
  • ▸Novel technical innovations including LatentMoE for accuracy and MTP layers for native speculative decoding improve both performance and efficiency
  • ▸Complete open-source release includes multiple model checkpoints, training datasets, and supporting artifacts, enabling community adoption and fine-tuning
Source:
Hacker Newshttps://research.nvidia.com/labs/nemotron/Nemotron-3-Super/↗

Summary

NVIDIA announced the release of Nemotron 3 Super, a 12B active/120B total parameter Mixture-of-Experts hybrid Mamba-Transformer model that combines convolutional and attention-based mechanisms for improved efficiency and performance. The model introduces LatentMoE for enhanced accuracy, MTP layers for native speculative decoding, and is pretrained in NVFP4, a custom floating-point format optimized for NVIDIA hardware. Nemotron 3 Super achieves up to 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on long-context workloads (8k input / 64k output tokens), while maintaining comparable or superior accuracy across diverse benchmarks. The company is releasing the complete model stack—including pre-trained, post-trained, and quantized checkpoints in multiple formats (NVFP4, FP8, BF16), along with the training datasets and a technical report. The release also includes specialized pretraining and post-training datasets targeting code, logic, and agentic capabilities, as well as a GenRM model for RLHF fine-tuning.

  • NVFP4 quantization and MoE architecture reduce computational requirements for deployment while maintaining model quality

Editorial Opinion

Nemotron 3 Super represents a significant step forward in making large-scale language models more practical for real-world deployment. By combining cutting-edge architectural innovations (LatentMoE, MTP layers, hybrid Mamba-Transformer) with aggressive quantization and open-source release, NVIDIA is directly addressing the deployment bottleneck that has limited the practical adoption of truly capable 120B+ parameter models. The performance gains—particularly the 7.5x speedup over Qwen3.5—could be transformative for latency-sensitive applications like real-time inference, while the open-source release signals NVIDIA's confidence in both the model quality and their hardware advantage in running these workloads efficiently.

Large Language Models (LLMs)Generative AIMachine LearningMLOps & InfrastructureOpen Source

More from NVIDIA

NVIDIANVIDIA
PRODUCT LAUNCH

Nvidia Challenges Apple Silicon with New RTX Spark PC Chip

2026-06-01
NVIDIANVIDIA
PRODUCT LAUNCH

Nvidia Announces RTX Spark: Entry into Consumer PC Chip Market with Local AI Agent Support

2026-06-01
NVIDIANVIDIA
PRODUCT LAUNCH

NVIDIA and Microsoft Reinvent Windows PCs with RTX Spark and Windows-Native AI Agents

2026-06-01

Comments

Suggested

VerseyVersey
RESEARCH

Versey Launches Autonomous Product Development System Powered by AI Engineers and AI COO

2026-06-01
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Unveils Surface Laptop Ultra: NVIDIA-Powered MacBook Pro Challenger with Desktop-Class AI Performance

2026-06-01
MinimaxMinimax
PRODUCT LAUNCH

MiniMax Debuts M3: Flagship AI Model for Complex Coding Tasks

2026-06-01
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us