VulkanForge – 14 MB Vulkan LLM engine that runs native FP8 models on AMD (Rust)

May 3, 2026 · 3:03 PM UTC ·13 min read · 0 reactions · 0 comments · 5 views

#machine learning #gpu computing #rust #vulkan #llm inference #VulkanForge #AMD #RDNA 4 #gfx1201 #oldnordic #ROCmForge #Meta-Llama-3.1-8B-Instruct-FP8 #neuralmagic

VulkanForge – 14 MB Vulkan LLM engine that runs native FP8 models on AMD (Rust)

⚡ TL;DR · AI summary

VulkanForge is a Vulkan-based LLM inference engine written in Rust, designed for AMD RDNA 4 (gfx1201) GPUs, supporting native FP8 model execution. It achieves high performance with efficient VRAM usage, enabling 14B-class models on 16 GiB GPUs. The engine supports end-to-end FP8 inference, including FP8 KV caching, and outperforms competing Vulkan implementations in decode and prefill tasks.

Key facts

▪VulkanForge is the first Vulkan engine to run full FP8 chat models, such as Meta-Llama-3.1-8B-Instruct-FP8, with a 7.48 GiB GPU footprint.
▪It introduces native FP8 KV cache via VK_EXT_shader_float8, reducing VRAM usage by 50% and increasing decode speed by up to 1.4%.
▪The engine supports SafeTensors FP8 loading, multiple FP8 GEMM kernels, and multi-submit prefill, achieving 68.5 tok/s decode and 695 tok/s prefill on Llama-3.1-8B.
▪VulkanForge is built directly on ash 0.38 (Vulkan 1.3) without higher-level wrappers, focusing solely on compute tasks.
▪It builds upon oldnordic's ROCmForge, inheriting core components like the model loader, CPU inference path, and GGUF parser.

Original article

GitHub

Read full at GitHub →

Opening excerpt (first ~120 words) tap to expand

VulkanForge A Vulkan-based LLM inference engine in Rust, targeting AMD RDNA 4 (gfx1201). Compute-only — no swapchain, no graphics queues — built directly on ash 0.38 (Vulkan 1.3) rather than a higher-level wrapper. This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality. Status v0.3.4 — native FP8 LLM end-to-end, multi-submit prefill, Q3_K / Q5_K coopmat, 14B-class headroom on 16 GiB.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.

Anonymous · no account needed

Discussion

0 comments

VulkanForge – 14 MB Vulkan LLM engine that runs native FP8 models on AMD (Rust)

Discussion

More from GitHub