Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

May 1, 2026 · 11:49 AM UTC · 0 reactions · 0 comments · 3 views

via

LocalLlama

## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: Build tested: ```text 67cb0d507 (8942) Setup: GPU: RTX 2080 SUPER 8GB Model: Qwen3.5-35B-A3B Q5_K_M Draft model: Qwen3.5-35B-A3B-DFlash Q4_K_M Backend: CUDA The main model is a 35B MoE GGUF around 24.44 GiB , so obviously it does not fit in 8GB VRAM. The trick was combinin

Original article

LocalLlama

Read full at LocalLlama →

Anonymous · no account needed

Discussion

0 comments

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

Discussion

More from LocalLlama