WeSearch
Hub / Tags / Llm Inference
TAG · #LLM-INFERENCE

Llm Inference coverage.

Every story in the WeSearch catalog tagged with #llm-inference, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

20 stories tagged with #llm-inference, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Llm Inference"

RELATED TAGS
#gpu-optimization2#asynchronous-processing1#cuda1#performance-optimization1#ml1#compiler-design1#high-performance-computing1#ada-mk1#wenxin-dong1#mingqing-hu1#guanghui-yu1#qiang-fu1
GITHUB

TensorSharp: Open-Source Local LLM Inference Engine

A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama…

29 views ·
#technology#software#open-source
GITHUB

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…

17 views ·
#technology#programming#machine learning
HACKER NEWS (AI / LLM)

How Many GPUs? A simple LLM inference sizing calculator

16 views ·
KOG LABS

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative d…

18 views ·
#ai#technology#gpu
ARXIV CS.AI

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…

26 views ·
#artificial intelligence#machine learning#performance evaluation
ANTIREZ

Distributing LLM Inference in DwarfStar

17 views ·
GITHUB

Show HN: YieldOS-Lite – A simulator for LLM inference control-plane governance

Contribute to nikitph/yieldos development by creating an account on GitHub.…

17 views ·
#technology#research#simulation
R/HOMELAB

RTX 6000 Ada vs RTX PRO Blackwell for local LLM inference?

20 views ·
ARXIV.ORG

SSV: Sparse Speculative Verification for Efficient LLM Inference

Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across m…

20 views ·
#computer science#machine learning#operating systems
SPRINGER

Characterization of machine learning compilers for LLM inference on NVIDIA GPUs

AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecos…

20 views ·
#machine learning#nvidia#artificial intelligence
BONZAI

Show HN: BonzAI – self-sovereign, local LLM inference in the browser

Generate unlimited AI content offline. Train custom models and earn crypto by serving them on our decentralized P2P network powered by Chainlink.…

15 views ·
ARXIV CS.AI

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory avai…

21 views ·
#artificial intelligence#machine learning#distributed computing
GITHUB

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips.…

19 views ·
#technology#artificial intelligence#machine learning
CO

AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ

NZ's Sovereign AI Inference Platform…

14 views ·
DEV.TO (TOP)

Agentic LLM Inference Parameters Reference for Qwen and Gemma

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k,...…

21 views ·
#llm#tuning#qwen
GOOGLE

ClickBook – Offline Android eReader with local LLM inference via llama.rn

Tap any word to instantly understand it. Offline AI ereader for EPUB and PDFs.…

15 views ·
ARXIV.ORG

Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet ever…

21 views ·
#machine learning#gpu optimization
R/LOCALLLAMA

I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash

15 views ·
GITHUB

Show HN: AI/ML benchmark for local LLM inference and XGBoost training on GPU/CPU

A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs - albedan/ai-ml-gpu-bench…

16 views ·
#ai#machine learning#benchmarking
HUGGINGFACE

Asynchronicity in Continuous Batching

We’re on a journey to advance and democratize artificial intelligence through open source and open science.…

19 views ·
#gpu optimization#asynchronous processing