60 stories tagged with #computer-vision, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Computer Vision"
NVIDIA Research Unlocks Advanced Grasping, Smarter Autonomous Driving and Agent Training at Scale
New NVIDIA Research breakthroughs show how training at scale — across gripper types, driving scenarios and virtual worlds — creates AI that generalizes to diverse applications.…
Effect of Demographic Bias on Skin Lesion Classification
In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, parti…
Apple's AI research will be in a computer vision conference before WWDC
Apple will present 14 AI research papers at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition in Denver next week, spanning image generation, spatial understa…
Apple to showcase computer vision studies at annual conference in June
Apple has shared details of its participation in this year’s IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).…
FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning
Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-t…
AssetGen: Deployable 3D Asset Generation at Interactive Speed
While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We presen…
VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchma…
In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Hist…
Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology
We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematic…
Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration
The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to…
Computer Vision Engineer, Looking for advice
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly re…
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations ca…
The TIME Machine: On The Power of Motion for Efficient Perception
Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models t…
Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering
Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floy…
CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection
Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing method…
Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking
Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve…
Lipschitz Optimization for Formal Verification of Homographies
The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles,…
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion
Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hin…
Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic object…
ChainFlow-VLA: Causal Flow Planning with Vision-Language Models
Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) model…
CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs
Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or…
Online Hand Gesture Recognition Using 3D Convolutional Neural Networks
In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no…
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provi…
AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language …
Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model
Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based s…
You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection
Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across al…
Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis
Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrain…
ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual cl…
Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning
Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge thes…
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually …
FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction
Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulat…
Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection
In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiot…
SDM: A Powerful Tool for Evaluating Model Robustness
Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant bre…
Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision
Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether a…
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived …
Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications…
SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework
Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-spe…
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains…
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous …
Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures
In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevert…
EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis
Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limit…
Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection
We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate cor…
ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society
Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban …
NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets…
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning
Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transforme…
Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts
Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually en…
Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models
Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holi…
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a…
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one …
Rethinking Cross-Layer Information Routing in Diffusion Transformers
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, obj…
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this settin…
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designer…
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene und…
USV: Towards Understanding the User-generated Short-form Videos
Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos ha…
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iterat…
Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation
We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scen…
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment a…
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieve…
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such re…