Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

Partha Sarkar· Apr 30, 2026 · 3:00 PM UTC ·14 min read · 0 reactions · 0 comments · 4 views

#proxy-pointer rag #multimodal rag #document structure #large language models #retrieval-augmented generation

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

⚡ TL;DR · AI summary

The article introduces Proxy-Pointer RAG, an open-source multimodal retrieval-augmented generation (RAG) pipeline that enables large language models to return images and other media in responses without relying on multimodal embeddings. By structuring documents as hierarchical trees of semantic blocks instead of text chunks, the system preserves context and ensures images are accurately grounded in relevant sections. The approach improves reliability in multimodal responses while maintaining scalability and reducing cost through a text-only processing pipeline.

Key facts

▪Proxy-Pointer RAG uses a tree-based document structure to maintain semantic integrity, avoiding the misalignment between retrieval units and meaningful content sections.
▪The system enables multimodal outputs like images and tables in responses without requiring multimodal embeddings, relying instead on document structure.
▪Traditional RAG methods struggle with image retrieval due to chunking that splits image captions or lacks contextual coherence across documents.
▪Multimodal embeddings can retrieve visually similar images but often fail to ground them correctly in the context of the query or document structure.
▪A working prototype was built using five CC-BY licensed AI research papers processed via Adobe PDF Extract API to demonstrate the system's capabilities.

Original article

Towards Data Science · Partha Sarkar

Read full at Towards Data Science →

Opening excerpt (first ~120 words) tap to expand

Large Language Model Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings Structure is all you need Partha Sarkar Apr 30, 2026 15 min read Share Generated using Gemini It is said that a picture is worth a thousand words. Yet, very few enterprise chatbots can reliably return images grounded in their source documents. Why is that? The reason is that although this would be a significant enhancement from the text-only user experience, it is difficult to do reliably and consistently. However, there is no dearth of use cases where this would be invaluable.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Towards Data Science.

Anonymous · no account needed

Discussion

0 comments

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

Discussion

More from Towards Data Science