#vllm — Tagged Stories | WeSearch Press

Every story in the WeSearch catalog tagged with #vllm, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

8 stories tagged with #vllm, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag → or search "Vllm"

RELATED TAGS

#ai-inference1 #code-llms1 #performance-benchmark1 #text-generation-inference1

DEV.TO (TOP)

llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

Test post…

19 views · Wed, 03 Jun 2026 06:11:56 GMT

#ai #llm #opensource

GITHUB

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…

17 views · Fri, 29 May 2026 19:45:02 GMT

#technology #programming #machine learning

DEV.TO (TOP)

Prefix caching in vLLM under multi-tenant agent traffic

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop...…

20 views · Tue, 26 May 2026 07:07:46 GMT

#mlops #infrastructure #pytorch

DEV.TO (TOP)

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

Running large language model inference servers in production exposes gaps that neither stock...…

11 views · Thu, 21 May 2026 11:51:11 GMT

#observability #machine learning #infrastructure

PHORONIX

Intel llm-scaler-vllm PV 1.4 Released With Updated Components, Arc Pro B70 Support

Intel software engineers today rolled out the llm-scaler-vllm PV v1.4 as the Docker build of their latest software stack for those wishing to run vLLM in a pre-configured, performa…

22 views · Wed, 20 May 2026 10:25:02 GMT

#intel #software #graphics

DEV.TO (TOP)

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

Ollama vs llama.cpp vs vLLM compared — ease of use, speed, GPU needs. Which inference engine is right for your workflow?…

15 views · Wed, 20 May 2026 01:34:58 GMT

#technology #ai #software

DEV.TO (TOP)

Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…

15 views · Wed, 29 Apr 2026 05:01:00 GMT

#ai inference #code llms #performance benchmark

VERCEL

Disaggregated Serving for Hybrid SSM Models in vLLM

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…

13 views · Tue, 28 Apr 2026 20:44:39 GMT

#machine learning #model serving #state-space models