NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference

Key Takeaways

▸Nemotron 3 Super achieves 2.2x-7.5x higher inference throughput than competing open-source models while supporting 1M token context length
▸Novel technical innovations including LatentMoE for accuracy and MTP layers for native speculative decoding improve both performance and efficiency
▸Complete open-source release includes multiple model checkpoints, training datasets, and supporting artifacts, enabling community adoption and fine-tuning

Source:

Hacker Newshttps://research.nvidia.com/labs/nemotron/Nemotron-3-Super/↗

Summary

NVIDIA announced the release of Nemotron 3 Super, a 12B active/120B total parameter Mixture-of-Experts hybrid Mamba-Transformer model that combines convolutional and attention-based mechanisms for improved efficiency and performance. The model introduces LatentMoE for enhanced accuracy, MTP layers for native speculative decoding, and is pretrained in NVFP4, a custom floating-point format optimized for NVIDIA hardware. Nemotron 3 Super achieves up to 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on long-context workloads (8k input / 64k output tokens), while maintaining comparable or superior accuracy across diverse benchmarks. The company is releasing the complete model stack—including pre-trained, post-trained, and quantized checkpoints in multiple formats (NVFP4, FP8, BF16), along with the training datasets and a technical report. The release also includes specialized pretraining and post-training datasets targeting code, logic, and agentic capabilities, as well as a GenRM model for RLHF fine-tuning.

NVFP4 quantization and MoE architecture reduce computational requirements for deployment while maintaining model quality

Editorial Opinion

Nemotron 3 Super represents a significant step forward in making large-scale language models more practical for real-world deployment. By combining cutting-edge architectural innovations (LatentMoE, MTP layers, hybrid Mamba-Transformer) with aggressive quantization and open-source release, NVIDIA is directly addressing the deployment bottleneck that has limited the practical adoption of truly capable 120B+ parameter models. The performance gains—particularly the 7.5x speedup over Qwen3.5—could be transformative for latency-sensitive applications like real-time inference, while the open-source release signals NVIDIA's confidence in both the model quality and their hardware advantage in running these workloads efficiently.

NVIDIA

PRODUCT LAUNCH NVIDIA2026-06-01

NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference

Key Takeaways

▸Nemotron 3 Super achieves 2.2x-7.5x higher inference throughput than competing open-source models while supporting 1M token context length
▸Novel technical innovations including LatentMoE for accuracy and MTP layers for native speculative decoding improve both performance and efficiency
▸Complete open-source release includes multiple model checkpoints, training datasets, and supporting artifacts, enabling community adoption and fine-tuning

Source:

Hacker Newshttps://research.nvidia.com/labs/nemotron/Nemotron-3-Super/↗

Summary

NVFP4 quantization and MoE architecture reduce computational requirements for deployment while maintaining model quality

Editorial Opinion

Nemotron 3 Super represents a significant step forward in making large-scale language models more practical for real-world deployment. By combining cutting-edge architectural innovations (LatentMoE, MTP layers, hybrid Mamba-Transformer) with aggressive quantization and open-source release, NVIDIA is directly addressing the deployment bottleneck that has limited the practical adoption of truly capable 120B+ parameter models. The performance gains—particularly the 7.5x speedup over Qwen3.5—could be transformative for latency-sensitive applications like real-time inference, while the open-source release signals NVIDIA's confidence in both the model quality and their hardware advantage in running these workloads efficiently.

NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

US Chip Security Act Mandates Location Tracking on Export-Controlled AI Accelerators

GPU Shortage to Persist Until 2028 as Token Demand Drives $2 Trillion Data Center Build-Out

Research: CTA-Pipelining Method Reduces LLM Inference Latency by Up to 31.8%

Comments

Suggested

Security Research Reveals How AI Code Reviewers Can Be Tricked Into Deploying Secret-Stealing Code

Thinking Machines Lab Releases Inkling, a 975B Open-Weight MoE with Architectural Innovations

TSMC Commits Additional $100B to US Operations as AI Chip Demand Surges

NVIDIA Releases Nemotron 3 Super: Open-Source 120B Hybrid Model with 2.2x Faster Inference

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

US Chip Security Act Mandates Location Tracking on Export-Controlled AI Accelerators

GPU Shortage to Persist Until 2028 as Token Demand Drives $2 Trillion Data Center Build-Out

Research: CTA-Pipelining Method Reduces LLM Inference Latency by Up to 31.8%

Comments

Suggested

Security Research Reveals How AI Code Reviewers Can Be Tricked Into Deploying Secret-Stealing Code

Thinking Machines Lab Releases Inkling, a 975B Open-Weight MoE with Architectural Innovations

TSMC Commits Additional $100B to US Operations as AI Chip Demand Surges