📄Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models
The paper titled 'RORA-VLM: Robust Retrieval Augmentation for Vision Language Models' was presented at ICLR 2025 but was unfortunately rejected. It proposes a framework that enhances Vision Language Models (VLM) by integrating external knowledge retrieval to improve question answering. The approach includes a two-stage retrieval process and noise-resilient training to ensure stable reasoning despite potential inaccuracies in retrieved information.
- ▪The paper introduces a robust retrieval framework for Vision Language Models.
- ▪It employs a two-stage retrieval process to enhance the model's ability to answer questions using external knowledge.
- ▪The training method includes intentionally introducing noise to help the model learn to ignore irrelevant information.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3189362) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Mercy Posted on May 29 📄Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models #ai #vlm #rag #paper Public At International Conference on Learning Representations (ICLR) 2025 💡 Why I read this 最近在找論文的 idea 剛好找到這篇,發表在 ICLR 2025,不過被 Reject 了有點可惜 這篇主要是把 RAG 應用到 VLM ,讓模型在回答問題時可以利用外部知識 在很多 VQA 的任務中,答案其實不在圖片裡面,而是需要額外的背景知識 例如一張圖顯示一種鳥,問題是:「這種鳥主要分布在哪裡?」 圖片只能讓你看出鳥長什麼樣,但像棲地這種資訊一定要查資料才知道 這篇主要在解決:「當 retrieved knowledge 有 noise 時,VLM 怎麼還能穩定推理? 🧠 Core idea 作者提出一個 robust…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).