15 stories tagged with #vlm, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Vlm"
📄Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models
Public At International Conference on Learning Representations (ICLR) 2025 💡 Why I read...…
LoongForge-A high-performance training framework for LLM, VLM, DIT, VLA models
A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. - baidu-baige/LoongForge…
Capping VLM spend per CV researcher: hierarchical budgets in practice
TL;DR: Our 11-person CV team at Prophesee was burning through €3-4k weeks of VLM spend on dataset...…
Show HN: Cursed Browser – a VLM reads the HTML and hallucinates the page
True AI-Native Browser — a VLM reads the HTML and hallucinates the page. - scosman/cursed_browser…
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Altho…
Autonomous Frontier-Based Exploration with VLM Guidance
Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Langua…
CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs
Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or…
Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference
In a previous article, we benchmarked three open-source Vision-Language Models on zero-shot object...…
Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs
If you have ever maintained a computer vision pipeline in a factory, warehouse, or construction site,...…
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception …
Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks de…
VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, w…
ChatGPT/Gemini can now draw on your screen to help you navigate complex software
SketchVLM: Vision-language models can annotate images to explain thoughts and guide users.…