Built a local LLM inference engine on CachyOS — runs faster than llama.cpp on my 9070 XT

May 3, 2026 · 2:54 PM UTC · 0 reactions · 0 comments · 5 views

via

StableDiffusion

Hey folks, we've been hacking on a Vulkan-based LLM engine the last few weeks, figured I'd share since I'm running it exclusively on CachyOS with Mesa RADV. It's called VulkanForge — single 14 MB Rust binary, no Python, no ROCm, just pure Vulkan compute shaders. Runs GGUF models (Q4_K_M etc.) and also native FP8 SafeTensors which llama.cpp can't even load. Some numbers on my RX 9070 XT (RADV Mesa 26.0.6): Qwen3-8B Q4_K_M: 134 tok/s decode (llama.cpp does ~129) Mistral-7B: 132 tok/s (llama.cpp ~1

Original article

StableDiffusion

Read full at StableDiffusion →

Anonymous · no account needed

Discussion

0 comments

Built a local LLM inference engine on CachyOS — runs faster than llama.cpp on my 9070 XT

Discussion

More from StableDiffusion