WeSearch

ChatGPT/Gemini can now draw on your screen to help you navigate complex software

·1 min read · 0 reactions · 0 comments · 3 views

SketchVLM: Vision-language models can annotate images to explain thoughts and guide users.

Original article
Github
Read full at Github →
Opening excerpt (first ~120 words) tap to expand

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision–language models (VLMs) such as Gemini-3-Pro and GPT-5 typically respond with only text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across six benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 points and sketch quality by up to +48.3% over image-editing and fine-tuned…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Github