GLM 5.1 Locally: 40tps, 2000+ pp/s
·
0 reactions
·
0 comments
·
11 views
After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year. Throughput by Context Depth Prefilled PP@4096 TG@512 0 2229.0 42.03 4K 1943.6 41.41 16K 1558.9 39.72 32K 1234.2 38.19 64K 863.5 35.87 TG Peak (burst throughput) 43.00 42.
Original article
LocalLlama
Anonymous · no account needed