WeSearch
Hub / Tags / Vllm
TAG · #VLLM

Vllm coverage.

Every story in the WeSearch catalog tagged with #vllm, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

8 stories tagged with #vllm, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Vllm"

RELATED TAGS
#ai-inference1#code-llms1#performance-benchmark1#text-generation-inference1
DEV.TO (TOP)

llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

Test post…

19 views ·
#ai#llm#opensource
GITHUB

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…

17 views ·
#technology#programming#machine learning
DEV.TO (TOP)

Prefix caching in vLLM under multi-tenant agent traffic

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop...…

20 views ·
#mlops#infrastructure#pytorch
DEV.TO (TOP)

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

Running large language model inference servers in production exposes gaps that neither stock...…

11 views ·
#observability#machine learning#infrastructure
PHORONIX

Intel llm-scaler-vllm PV 1.4 Released With Updated Components, Arc Pro B70 Support

Intel software engineers today rolled out the llm-scaler-vllm PV v1.4 as the Docker build of their latest software stack for those wishing to run vLLM in a pre-configured, performa…

22 views ·
#intel#software#graphics
DEV.TO (TOP)

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

Ollama vs llama.cpp vs vLLM compared — ease of use, speed, GPU needs. Which inference engine is right for your workflow?…

15 views ·
#technology#ai#software
DEV.TO (TOP)

Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

Serving code LLMs at production scale is 3.2x more expensive than general-purpose LLMs when using...…

15 views ·
#ai inference#code llms#performance benchmark
VERCEL

Disaggregated Serving for Hybrid SSM Models in vLLM

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way…

13 views ·
#machine learning#model serving#state-space models