Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB
·
0 reactions
·
0 comments
·
3 views
## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: Build tested: ```text 67cb0d507 (8942) Setup: GPU: RTX 2080 SUPER 8GB Model: Qwen3.5-35B-A3B Q5_K_M Draft model: Qwen3.5-35B-A3B-DFlash Q4_K_M Backend: CUDA The main model is a 35B MoE GGUF around 24.44 GiB , so obviously it does not fit in 8GB VRAM. The trick was combinin
Original article
LocalLlama
Anonymous · no account needed