RAG - Sparse Embedding
Sparse embeddings represent text chunks as tokens based on their presence in a vocabulary dictionary. They are primarily used for direct text matching and keyword-based retrieval, focusing on exact keyword matches rather than semantic understanding. Modern systems often combine sparse and dense embeddings to enhance retrieval performance.
- ▪Sparse embeddings assign a value of 1 to tokens present in the vocabulary and 0 to those that are not.
- ▪The main drawback of basic sparse representation is that it does not account for the frequency of word occurrences in a document.
- ▪BM25 is an advanced ranking algorithm that improves upon TF-IDF by considering term frequency, document length, and query relevance.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3900955) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Ramya Perumal Posted on May 27 RAG - Sparse Embedding #ai #beginners #rag Sparse means thinly spread, scattered, or not dense. In sparse embeddings, chunks are converted into tokens, and each token is represented based on whether it exists in the vocabulary dictionary. If a token is present in the vocabulary, it is assigned 1; otherwise, it is assigned 0.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).