60 stories tagged with #inference, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Inference"
Still: Amortized KV Cache Compaction in a Single Forward Pass
The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive…
TensorSharp: Open-Source Local LLM Inference Engine
A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama…
Lean Inference: Lean Manufacturing Principles Applied to AI
Making inference scale in a cost effective way…
Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive
Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…
FingerMotion shares rise on entry into edge AI inference computing market
Building a High-Performance Real-Time Data Pipeline with Edge Inference and Observability
Building a High-Performance Real-Time Data Pipeline with Edge Inference and...…
With Nvidia Groq 3, the Era of AI Inference Is (Probably) Here (⌛ March 2026)
What makes Nvidia's new Groq 3 LPU chip a must-watch in the AI world?…
Computer Use Agents Go Local: A Deep Technical Dive into On-Device GUI Automation, Quantized Inference & Holo3.1
Meta Description: Learn how to build production-grade local computer use agents using Holo3.1's...…
Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic…
Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs
The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This pr…
Inference + Agentic AI race (groq LPU vs SambaNova RDU) vs alternatives for Decode
Megaport secures 4 AI deals, to raise $594 million to build inference cloud
Everyone here self-hosts inference. Almost nobody self-hosts the tooling around it. That feels backwards to me.
Prediction: This Artificial Intelligence (AI) Inference Specialist Is Going to Soar After June 3
Inference Theft Is the New AI App Security Bug: How to Protect Your LLM Endpoints
A practical checklist for protecting public AI endpoints from model abuse, runaway agent loops, and surprise inference bills.…
Silicon Motion new SM2524XT PCIe 5 controller achieves 14GB/s read and 12GB/s write speeds with up to 2.5 million IOPS and up to 25% higher performance-per-watt, designed for AI inference
Enterprise AI Governance Starts With Identity, Not Inference
The mistake most teams make with AI governance is starting in the wrong place. They start with model...…
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM - jmaczan/tiny-vllm…
DinoV3 Embedding inference and visualization with Rust, ort and egui!
How Many GPUs? A simple LLM inference sizing calculator
The Apple Neural Engine Inference Book
Sources: ByteDance has partnered with chipmaker InnoStar to develop an AI inference chip modeled after Groq's LPUs, which are built to run AI models at low cost (The Information)
KV-Pool: 4.5x Agent Inference Throughput with Persistent KV Cache
Why Agent Workloads Are Expensive LLM inference costs always scale with context length. In...…
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative d…
Show HN: Static-allocation MLP inference in ANSI C using a 2-slot ring buffer
Static-allocation MLP inference in ANSI C using 2-slot circular buffer with fixed stride indexing. An easy to use, minimal MLP alternative to GiorgosXou/NeuralNetworks enhanced wit…
Argonne flexes spare supercompute to build private AI inference service
Think ChatDoE…
90% cheaper repo inference with GPT-5.4 nano
For bounded orchestration decisions, the right model is often the smallest one that can pass a focused validation loop.…
Stress disrupts hippocampal integration of overlapping events, memory inference
Tensormesh, whose inference platform uses KV caching to reduce costs, raised a $20M seed extension, bringing its total funding to $24.5M (Chris Metinko/Axios)
Tensormesh Raises $20M from Investors Including AMD Ventures, CoreWeave, NVentures, Launches Tensormesh Inference to Fix AI’s Most Expensive Problem - Morningstar
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Imece – Distributed AI inference using volunteer GPUs and FLOP token
A decentralized AI compute cooperative where contributors earn inference credits by donating idle GPU/CPU time — measured in FLOPs, not crypto. - aslankose/imece…
I Squared Capital buys $225M data center portfolio from Cogent Fiber to build AI inference platform
I Squared Capital acquires 10 data center facilities from Cogent Fiber for $225M, committing up to $1B to build a US platform focused on AI inference workloads.…
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing t…
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two indepe…
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining…
I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required
Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of...…
Boom Times for Inference Providers? - The Information
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Source: AI inference provider Baseten is in talks to raise $1B at a post-money valuation of $11B, up from $5B after its $300M Series E announced in January (The Information)
Show HN: MurrDB: A RocksDB-based NVMe/S3 cache for AI inference workloads
Verbosity is not faithfulness: an architectural argument that reasoning models cannot perform faithful inference [D]
Researchers develop Bayesian inference for hidden dependence structures in multi-group high-dimensional data
I Squared bets on AI inference with $225 million data center buy from Cogent
BODHI: Precise OS Kernel Specification Inference
The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demand…
Inference Time Context Sparsity: Illusion or Opportunity?
Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interacti…
EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages
Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communic…
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…
Hypothesis Generation and Inductive Inference in Children and Language Models
Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which comp…
Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction
Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corr…
Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models
Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories …
New local model reaching near frontier on PII removal at 9 ms CPU inference
Building Conifer, an open-source local inference runtime (free + open source)
Planning a dual 3090 inference server -- sanity check before I buy
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM?
DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]
Is AI inference platform really that saturated now? [D]
Distributing LLM Inference in DwarfStar
Model Routing Cost Checklist: Hosted APIs, Open Models, Or Self-Hosted Inference?
Originally published on TechSaaS Cloud Originally published on TechSaaS Cloud Model...…
Show HN: YieldOS-Lite – A simulator for LLM inference control-plane governance
Contribute to nikitph/yieldos development by creating an account on GitHub.…
Components Check Before Order - Inference/Games
XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms
AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance…