VulkanForge – 14 MB Vulkan LLM engine that runs native FP8 models on AMD (Rust)
VulkanForge is a Vulkan-based LLM inference engine written in Rust, designed for AMD RDNA 4 (gfx1201) GPUs, supporting native FP8 model execution. It achieves high performance with efficient VRAM usage, enabling 14B-class models on 16 GiB GPUs. The engine supports end-to-end FP8 inference, including FP8 KV caching, and outperforms competing Vulkan implementations in decode and prefill tasks.
- ▪VulkanForge is the first Vulkan engine to run full FP8 chat models, such as Meta-Llama-3.1-8B-Instruct-FP8, with a 7.48 GiB GPU footprint.
- ▪It introduces native FP8 KV cache via VK_EXT_shader_float8, reducing VRAM usage by 50% and increasing decode speed by up to 1.4%.
- ▪The engine supports SafeTensors FP8 loading, multiple FP8 GEMM kernels, and multi-submit prefill, achieving 68.5 tok/s decode and 695 tok/s prefill on Llama-3.1-8B.
- ▪VulkanForge is built directly on ash 0.38 (Vulkan 1.3) without higher-level wrappers, focusing solely on compute tasks.
- ▪It builds upon oldnordic's ROCmForge, inheriting core components like the model loader, CPU inference path, and GGUF parser.
Opening excerpt (first ~120 words) tap to expand
VulkanForge A Vulkan-based LLM inference engine in Rust, targeting AMD RDNA 4 (gfx1201). Compute-only — no swapchain, no graphics queues — built directly on ash 0.38 (Vulkan 1.3) rather than a higher-level wrapper. This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality. Status v0.3.4 — native FP8 LLM end-to-end, multi-submit prefill, Q3_K / Q5_K coopmat, 14B-class headroom on 16 GiB.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.