WeSearch

Built a local LLM inference engine on CachyOS — runs faster than llama.cpp on my 9070 XT

· 0 reactions · 0 comments · 5 views

Hey folks, we've been hacking on a Vulkan-based LLM engine the last few weeks, figured I'd share since I'm running it exclusively on CachyOS with Mesa RADV. It's called VulkanForge — single 14 MB Rust binary, no Python, no ROCm, just pure Vulkan compute shaders. Runs GGUF models (Q4_K_M etc.) and also native FP8 SafeTensors which llama.cpp can't even load. Some numbers on my RX 9070 XT (RADV Mesa 26.0.6): Qwen3-8B Q4_K_M: 134 tok/s decode (llama.cpp does ~129) Mistral-7B: 132 tok/s (llama.cpp ~1

Original article
StableDiffusion
Read full at StableDiffusion →
Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from StableDiffusion