Kv cache quantization: ignorance, or malice?
·
0 reactions
·
0 comments
·
1 view
I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between speed and reliability. I want to bring up a particular point of contention regarding this optimization process. I have extensive software engineering background but am relatively new to this so feel free to correct me if I’m not on the right track. It seems like co
Original article
LocalLlama
Anonymous · no account needed