VulkanForge: First Vulkan LLM Engine to Support Native FP8 Models on AMD RDNA 4

Key Takeaways

▸VulkanForge is the first Vulkan LLM engine to achieve native end-to-end FP8 execution with native FP8 KV cache (VK_EXT_shader_float8), reducing VRAM usage by up to 28.2% without quality loss
▸Meta-Llama-3.1-8B achieves 68.5 tok/s decode and 695 tok/s prefill on AMD RDNA 4, outperforming llama.cpp Vulkan on decode benchmarks and achieving competitive prefill performance (0.89–0.95×)
▸Native FP8 support spans compressed-tensors SafeTensors loading, FP8 GEMV decode kernels, and three FP8 GEMM prefill kernels (naive, aligned, multi-WG) with multi-submit prefill pacing

Source:

Hacker Newshttps://github.com/maeddesg/vulkanforge↗

Summary

VulkanForge, a lightweight 14 MB LLM inference engine written in Rust, has released v0.3.4 with native end-to-end FP8 model support for AMD RDNA 4 GPUs. Built directly on Vulkan 1.3 (via ash 0.38) as a compute-only engine without graphics overhead, VulkanForge demonstrates that specialized hardware optimization can deliver significant efficiency gains in LLM inference. The latest release introduces native FP8 KV cache support via VK_EXT_shader_float8, optimized FP8 GEMV/GEMM kernels, and multi-model support with an interactive chat CLI.

Benchmark results show competitive performance: Meta-Llama-3.1-8B runs at 68.5 tok/s decode and 695 tok/s prefill (@ pp=512) with just 7.48 GiB GPU footprint—a 28.2% reduction in VRAM compared to FP16 variants. VulkanForge achieves 1.04-1.06× better decode performance than llama.cpp Vulkan on several quantized models (Q4_K_M, Q3_K_M variants), beating llama.cpp in 4-5 configuration comparisons. The engine includes native FP8 SafeTensors loading with per-tensor compression support, allowing models like neuralmagic's Meta-Llama-3.1-8B-Instruct-FP8 to run uncompressed end-to-end.

The project builds on community foundations, particularly acknowledging the ROCmForge work that provided the model loader, GGUF parser, and initial architecture. This demonstrates how open-source, hardware-targeted optimization can unlock practical efficiency improvements for practitioners using quantized models on AMD's latest GPU architecture.

A minimal 14 MB compute-only engine built directly on Vulkan 1.3, demonstrating that specialized hardware optimization without graphics-layer overhead can deliver superior efficiency for inference workloads

Editorial Opinion

VulkanForge exemplifies community-driven engineering excellence in AI infrastructure. While major AI labs optimize for broad platform support, this project shows how focused work on specialized hardware—AMD RDNA 4 in this case—can unlock significant real-world gains: 28% VRAM savings and superior decode throughput. The native FP8 implementation without unpacking to higher precision represents sophisticated systems-level thinking that directly benefits researchers and practitioners working with quantized models.

VulkanForge: First Vulkan LLM Engine to Support Native FP8 Models on AMD RDNA 4

Key Takeaways

▸VulkanForge is the first Vulkan LLM engine to achieve native end-to-end FP8 execution with native FP8 KV cache (VK_EXT_shader_float8), reducing VRAM usage by up to 28.2% without quality loss
▸Meta-Llama-3.1-8B achieves 68.5 tok/s decode and 695 tok/s prefill on AMD RDNA 4, outperforming llama.cpp Vulkan on decode benchmarks and achieving competitive prefill performance (0.89–0.95×)
▸Native FP8 support spans compressed-tensors SafeTensors loading, FP8 GEMV decode kernels, and three FP8 GEMM prefill kernels (naive, aligned, multi-WG) with multi-submit prefill pacing

Summary

A minimal 14 MB compute-only engine built directly on Vulkan 1.3, demonstrating that specialized hardware optimization without graphics-layer overhead can deliver superior efficiency for inference workloads

Editorial Opinion

VulkanForge exemplifies community-driven engineering excellence in AI infrastructure. While major AI labs optimize for broad platform support, this project shows how focused work on specialized hardware—AMD RDNA 4 in this case—can unlock significant real-world gains: 28% VRAM savings and superior decode throughput. The native FP8 implementation without unpacking to higher precision represents sophisticated systems-level thinking that directly benefits researchers and practitioners working with quantized models.

VulkanForge: First Vulkan LLM Engine to Support Native FP8 Models on AMD RDNA 4

Key Takeaways

Summary

Editorial Opinion

More from Open Source Community

ML-intern: New Open-Source Agent Framework for Autonomous ML Research and Training

Comprehensive ML Inference Benchmark Reveals PyTorch Dominance on NVIDIA, But MLX Crushes on Apple Silicon

New Native C#/.NET LLM Inference Engine Eliminates Python Dependencies

Comments

Suggested

86% of Phishing Campaigns Now AI-Enabled, With Attackers Weaponizing Large Language Models

AI Music Floods Streaming Platforms as Suno and Udio Democratize Creation

Eli Lilly Partners with AI Biotech Profluent Bio on $2.25B Genetic Medicine Deal

VulkanForge: First Vulkan LLM Engine to Support Native FP8 Models on AMD RDNA 4

Key Takeaways

Summary

Editorial Opinion

More from Open Source Community

ML-intern: New Open-Source Agent Framework for Autonomous ML Research and Training

Comprehensive ML Inference Benchmark Reveals PyTorch Dominance on NVIDIA, But MLX Crushes on Apple Silicon

New Native C#/.NET LLM Inference Engine Eliminates Python Dependencies

Comments

Suggested

86% of Phishing Campaigns Now AI-Enabled, With Attackers Weaponizing Large Language Models

AI Music Floods Streaming Platforms as Suno and Udio Democratize Creation

Eli Lilly Partners with AI Biotech Profluent Bio on $2.25B Genetic Medicine Deal