GLM 5.1 Locally: 40tps, 2000+ pp/s

Apr 25, 2026 · 4:31 PM UTC · 0 reactions · 0 comments · 11 views

After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year. Throughput by Context Depth Prefilled PP@4096 TG@512 0 2229.0 42.03 4K 1943.6 41.41 16K 1558.9 39.72 32K 1234.2 38.19 64K 863.5 35.87 TG Peak (burst throughput) 43.00 42.

Original article

LocalLlama

Read full at LocalLlama →

Anonymous · no account needed

Discussion

0 comments

GLM 5.1 Locally: 40tps, 2000+ pp/s

Discussion

More from LocalLlama