VulkanForge: First Vulkan LLM Engine to Support Native FP8 Models on AMD RDNA 4
Key Takeaways
- ▸VulkanForge is the first Vulkan LLM engine to achieve native end-to-end FP8 execution with native FP8 KV cache (VK_EXT_shader_float8), reducing VRAM usage by up to 28.2% without quality loss
- ▸Meta-Llama-3.1-8B achieves 68.5 tok/s decode and 695 tok/s prefill on AMD RDNA 4, outperforming llama.cpp Vulkan on decode benchmarks and achieving competitive prefill performance (0.89–0.95×)
- ▸Native FP8 support spans compressed-tensors SafeTensors loading, FP8 GEMV decode kernels, and three FP8 GEMM prefill kernels (naive, aligned, multi-WG) with multi-submit prefill pacing
Summary
VulkanForge, a lightweight 14 MB LLM inference engine written in Rust, has released v0.3.4 with native end-to-end FP8 model support for AMD RDNA 4 GPUs. Built directly on Vulkan 1.3 (via ash 0.38) as a compute-only engine without graphics overhead, VulkanForge demonstrates that specialized hardware optimization can deliver significant efficiency gains in LLM inference. The latest release introduces native FP8 KV cache support via VK_EXT_shader_float8, optimized FP8 GEMV/GEMM kernels, and multi-model support with an interactive chat CLI.
Benchmark results show competitive performance: Meta-Llama-3.1-8B runs at 68.5 tok/s decode and 695 tok/s prefill (@ pp=512) with just 7.48 GiB GPU footprint—a 28.2% reduction in VRAM compared to FP16 variants. VulkanForge achieves 1.04-1.06× better decode performance than llama.cpp Vulkan on several quantized models (Q4_K_M, Q3_K_M variants), beating llama.cpp in 4-5 configuration comparisons. The engine includes native FP8 SafeTensors loading with per-tensor compression support, allowing models like neuralmagic's Meta-Llama-3.1-8B-Instruct-FP8 to run uncompressed end-to-end.
The project builds on community foundations, particularly acknowledging the ROCmForge work that provided the model loader, GGUF parser, and initial architecture. This demonstrates how open-source, hardware-targeted optimization can unlock practical efficiency improvements for practitioners using quantized models on AMD's latest GPU architecture.
- A minimal 14 MB compute-only engine built directly on Vulkan 1.3, demonstrating that specialized hardware optimization without graphics-layer overhead can deliver superior efficiency for inference workloads
Editorial Opinion
VulkanForge exemplifies community-driven engineering excellence in AI infrastructure. While major AI labs optimize for broad platform support, this project shows how focused work on specialized hardware—AMD RDNA 4 in this case—can unlock significant real-world gains: 28% VRAM savings and superior decode throughput. The native FP8 implementation without unpacking to higher precision represents sophisticated systems-level thinking that directly benefits researchers and practitioners working with quantized models.



