Advanced RAG Techniques

如果无法正常显示，请先停止浏览器的去广告插件。

相关话题： #RAG

1. Advanced RAG Techniques A guide on different techniques to improve the performance of your Retrieval-Augmented Generation applications.

2. Advanced R AG T echniques | weaviate E book v al - augmented Retrie f rom generation w ledge generated responses . an external kno ( R AG) v ides pro v e generati ( LLMs ) w ith in f ormation the f actual accuracy o f the large language models source to help reduce hallucinations and increase W hile this nai v e approach q uality responses . - This e book discusses A v e f f our components : an embedding model , a v ector database , a prompt template , and a generati v e LLM . A t in f erence time , it embeds the user q uery to retrie v e rele v ant document chunks o f in f ormation f rom the v ector database , w hich it stu ff s into the LLM ’ s prompt to generate an ans w er . nai R AG pipeline consists o AG system . sho w n belo w: o f your R f w ard , is straight or v arious These f v anced techni q ues you can apply to impro v e the per f ormance techni q ues can be applied at v arious stages in the R AG pipeline , as ad Indexing Optimization Techniques Documents Chunks Indexing w- it has many limitations and can o ten lead to lo 3 Data Pre-processing Chunking Strategies Pre-retrieval Optimization Techniques Query Pre-Retrieval 5 Query Transformation Query Decomposition Query Routing Embedding Model Retrieval Retrieval Optimization Strategies 7 Metadata Filtering Excluding Vector Search Outliers Hybrid search Embedding model fine-tuning Vector Database Context Prompt Template Response LLM Post-Retrieval s t-retrieval Optimization Techniques Re-ran k ing C ontext P ost-processing P rompt Engineering LL M Fine-tuning Po 10

3. Advanced RAG Techniques | weaviate Ebook Data Pre-Processing Documents Chunks Indexing Query Data pre-processing is fundamental to the success of any RAG system, as the quality of your processed data directly impacts the overall performance. By thoughtfully transforming raw data into a structured format suitable for LLMs, you can significantly enhance your system's effectiveness before considering more complex optimizations. While there are several common pre-processing techniques available, the optimal approach and sequence should be tailored to your specific use case and requirements. Embedding Model The process usually begins with data acquisition and integration, where diverse document types from multiple sources are collected and consolidated into a ‘knowledge base’. Vector Database Data Sources Context Raw Data Source 1 Source 2 Prompt Template Response LLM Indexing Optimization Techniques Index optimization techniques enhance retrieval accuracy by structuring external data in more organized, searchable ways. These techniques can be applied to both data pre-processing and chunking stages in the RAG pipeline, ensuring that relevant information is effectively retrieved. Source 3 Data Extraction and Data Parsing Data extraction and parsing take place over the raw data so that it is accurately processed for downstream tasks. For text-based formats like Markdown, Word documents, and plain text, extraction techniques focus on preserving structure while capturing relevant content. Scanned documents, images, and PDFs containing image-based text/tables require OCR (Optical Character Recognition) technology to convert into an ‘LLM-ready’ format. However, recent advancements in multimodal retrieval models, such as ColPali and ColQwen, have revolutionized this process. These models can directly embed images of documents, potentially making traditional OCR obsolete. Web content often involves HTML parsing, utilizing DOM traversal to extract structured data, while spreadsheets demand specialized parsing to handle cell relationships. Metadata extraction is also crucial across file types, pulling key details like author, timestamps, and other document properties (see Metadata Filtering) 3

4. Advanced RAG Techniques | weaviate Ebook Data Cleaning Chunking Strategies Data cleaning and noise reduction involves removing irrelevant information (such as headers, footers, or boilerplate text), correcting inconsistencies, and handling missing values while maintaining the extracted data's structural integrity. Chunking divides large documents into smaller, semantically meaningful segments. This process optimizes retrieval by balancing context preservation with manageable chunk sizes. Various common techniques exist for effective chunking in RAG, some of which are discussed below: Fixed-size chunking is a simple technique that splits text into chunks of a predetermined size, regardless of content structure. While it's cost-effective, it lacks contextual awareness. This can be improved by using overlapping chunks, allowing adjacent chunks to share some content. Recursive chunking offers more flexibility by initially splitting text using a primary separator (like paragraphs) and then applying secondary separators (like sentences) if chunks are still too large. This technique respects the document's structure and adapts well to various use cases. Data Transformation This involves converting all extracted and processed content into a standardized schema, regardless of the original file type. It's at this stage that document partitioning (not to be confused with chunking) occurs, separating document content into logical units or elements (e.g., paragraphs, sections, tables) Document-based chunking creates chunks based on the natural divisions within a document, such as headings or sections. It's particularly effective for structured data like HTML, Markdown, or code files but less useful when the data lacks clear structural elements. Semantic chunking divides text into meaningful units, which are then vectorized. These units are then combined into chunks based on the cosine distance between their embeddings, with a new chunk formed whenever a significant context shift is detected. This { ... : ..., ... : [ {...}, {...} ] } method balances semantic coherence with chunk size. LLM-based chunking is an advanced technique that uses an LLM to generate chunks by processing text and creating semantically isolated sentences or propositions. While highly accurate, it's also the most computationally demanding approach. Each of the discussed techniques has its strengths, and the choice depends on the RAG system's specific requirements and the nature of the documents being processed. N ew approaches continue to emerge, such as late chunking, which processes text through long - context embedding models before splitting it into chunks to better preserve document - wide context . 4

5. Advanced RAG Techniques | weaviate Ebook Query Transformation Documents Chunks Pre-Retrieval Query Embedding Model Vector Database Context Prompt Template Response LLM Pre-retrieval Optimization Techniques Index optimization techniques enhance retrieval accuracy by structuring external data in more organized, searchable ways. These techniques can be applied to both data pre-processing and chunking stages in the RAG pipeline, ensuring that relevant information is effectively retrieved. Using the user query directly as the search query for retrieval can lead to poor search results. That’s why turning the raw user query into an optimized search query is essential. Query transformation refines and expands unclear, complex, or ambiguous user queries to improve the quality of search results. Query Rewriting involves reformulating the original user query to make it more suitable for retrieval. This is particularly useful in scenarios where user queries are not optimally phrased or expressed differently. This can be achieved by using an LLM to rephrase the original user query or employing specialized smaller language models trained specifically for this task. This approach is called 'Rewrite-Retrieve-Read' instead of the traditional 'Retrieve-then-Read' paradigm. R a w Query Query R e- w riter (LLM) Can you tell me which movies were popular last summer? I’m trying to find a blockbuster film. R e w ritten Query R etriever R etrieve d D ocuments W hat were the top-grossing movies released last summer? Qu er y Exp a ns i on focuses on broadening the original query to capture more relevant information. This involves using an LLM to generate multiple similar queries based on the user's initial input. These expanded queries are then used in the retrieval process, increasing both the number and relevance of retrieved documents. Note: Due to the increased quantity of retrieved documents, a reranking step is often necessary to prioritize the most relevant results (see Re-ranking). Ex pan d e d Queries R a w Query Query R e- w riter (LLM) R etriever R etrieve d D ocuments H ow does meditation reduce stress and anxiety? Can meditation improve focus and concentration? W hat are the benefits of meditation? W hat are the long-term mental health benefits of meditation? H ow does meditation affect sleep quality? 5

6. Advanced RAG Techniques | weaviate Ebook Query Decomposition Query Routing Query routing is a technique that directs queries to specific pipelines based on their content and intent, enabling a RAG system to handle diverse scenarios effectively. It works by analyzing each query and choosing the best retrieval method or processing pipeline to provide an accurate response. This often requires implementing multi-index strategies, where different types of information are organized into separate, specialized indexes optimized. The process can include agentic elements, where AI agents decide how to handle each query. These agents evaluate factors such as query complexity and domain to determine the optimal approach. For example, fact-based questions may be routed to one pipeline, while those requiring summarization or interpretation are sent to another. Agentic RAG functions like a network of specialized agents, each with different expertise. It can choose from various data stores, retrieval strategies (keyword-based, semantic, or hybrid), query transformations (for poorly structured queries), and specialized tools or APIs, such as text-to-SQL converters or even web search capabilities. Query decomposition is a technique that breaks down complex queries into simpler sub- queries. This is useful for answering multifaceted questions requiring diverse information sources, leading to more precise and relevant search results. The process typically involves two main stages: decomposing the original query into smaller, focused sub-queries using an LLM and then processing these sub-queries to retrieve relevant information. Tools Query Retrieval Agent For example, the complex query “Why am I always so tired even though I eat healthy? Should I be doing something different with my diet or maybe try some diet trends?” can be decomposed into the following three simpler sub-queries Collection A Vector search engine B Collection B Calculator What are the common dietary factors that can cause fatigue What are some popular diet trends and their effects on energy levels How can I determine if my diet is balanced and supports my energy needs? Each sub-query targets a specific aspect, enabling the retriever to find relevant documents or chunks. Sub-queries can also be processed in parallel to improve efficiency. Additional techniques like keyword extraction and metadata filter extraction can help identify both key search terms and structured filtering criteria, enabling more precise searches. After retrieval, the system aggregates and synthesizes results from all sub-queries to generate a comprehensive answer to the original complex query. Vector search engine A Web search Response LLM Single Agent RAG System (Router) 6

7. Advanced RAG Techniques | weaviate Ebook Metadata Filtering Documents Chunks Query Embedding Model Retrieval Vector Database Context Prompt Template Response LLM Retrieval Optimization Strategies Retrieval optimization strategies aim to improve retrieval results by directly manipulating the way in which external data is retrieved in relation to the user query. This can involve refining the search query, such as using metadata to filter candidates or excluding outliers, or even involve fine-tuning an embedding model on external data to improve the quality of the underlying embeddings themselves. Metadata is the additional information attached to each document or chunk in a vector database, providing valuable context to enhance retrieval. This supplementary data can include timestamps, categories, author info, source references, languages, file types, etc. When retrieving content from a vector database, metadata helps refine results by filtering out irrelevant objects, even when they are semantically similar to the query. This narrows the search scope and improves the relevance of the retrieved information. Another benefit of using metadata is time-awareness. By incorporating timestamps as metadata, the system can prioritize recent information, ensuring the retrieved knowledge remains current and relevant. This is particularly useful in domains where information freshness is critical. To get the most out of metadata filtering, it's important to plan carefully and choose metadata that improves search without adding unnecessary complexity. 7

8. Advanced RAG Techniques | weaviate Ebook Excluding Vector Search Outliers Hybrid Search V ecto r S earch 3 Contex t A Contex t B Context C Context B 1 2 The most straightforward approach to defining the number of returned results is explicitly setting a value for the top k (top_k) results. If you set top_k to 5, you'll get the five closest vectors, regardless of their relevance. While easy to implement, this can include poor matches just because they made the cutoff. Here are two techniques to manage the number of search results implicitly that can help with excluding outliers: Distance thresholding adds a quality check by setting a maximum allowed distance between vectors. Any result with a distance score above this threshold gets filtered out, even if it would have made the top_k cutoff. This helps remove the obvious bad matches but requires careful threshold adjustment. Autocut is more dynamic - it looks at how the result distances are clustered. Instead of using fixed limits, it groups results based on their relative distances from your query vector. When there's a big jump in distance scores between groups, Autocut can cut off the results at that jump. This catches outliers that might slip through top_k or basic distance thresholds. Q uery Context A F usion Algorithm Context C K eywor d S earch Contex t B Contex t C Context A Hybrid search combines the strengths of vector-based semantic search with traditional keyword-based methods. This technique aims to improve the relevance and accuracy of retrieved information in RAG systems. The key to hybrid search lies in the 'alpha' (α) parameter, which controls the balance between semantic and keyword-based search methods α = 1: Pure semantic searc α = 0: Pure keyword-based searc 0 < α < 1: Weighted combination of both methods This approach is particularly beneficial when you need both contextual understanding and exact keyword matching. Consider a technical support knowledge base for a software company. A user might submit a query like "Excel formula not calculating correctly after update". In this scenario, semantic search helps understand the context of the problem, potentially retrieving articles about formula errors, calculation issues, or software update impacts. Meanwhile, keyword search ensures that documents containing specific terms like "Excel" and "formula" are not overlooked. Therefore, while implementing hybrid search, it’s crucial to adjust the alpha parameter based on your specific use case to optimize the performance. 8

9. Advanced RAG Techniques | weaviate Ebook Embedding Model Fine-Tuning Off-the-shelf embedding models are usually trained on large general datasets to embed a wide range of data inputs. However, embedding models can fail to capture the context and nuances of smaller, domain-specific datasets. Fine-tuning embedding models on custom datasets can significantly improve the quality of embeddings, subsequently improving performance on downstream tasks like RAG. Fine-tuning improves embeddings to better capture the dataset's meaning and context, leading to more accurate and relevant retrievals in RAG applications. To fine-tune an existing embedding model you first need to select a base model that you would like to improve. Next, you begin the fine-tuning process by providing the model with your domain-specific data. During this process, the loss function adjusts the model’s embeddings so that semantically similar items are placed closer together in the embedding space. To evaluate a fine-tuned embedding model, you can use a validation set of curated query-answer pairs to assess the quality of retrieval in your RAG pipeline. Now, the model is ready to generate more accurate and representative embeddings for your specific dataset. The more niche your dataset is, the more it can benefit from embedding model fine-tuning. Datasets with specialized vocabularies, like medical or legal datasets, are ideal for embedding model fine-tuning, which helps extend out-of-domain vocabularies and enhance the accuracy and relevance of information retrieval and generation in RAG pipelines. 9

10. Advanced RAG Techniques | weaviate Ebook Documents Re-Ranking Chunks One proven method to improve the performance of your information retrieval system is to leverage a retrieve-and-rerank pipeline. A retrieve-and-rerank pipeline combines the speed of vector search with the contextual richness of a re-ranking model. Query In vector search, the query and documents are processed separately. First, the documents are pre-indexed. Then, at query time, the query is processed, and the documents closest in vector space are retrieved. While vector search is a fast method to retrieve candidates, it can miss contextual nuances. Embedding Model Vector Database Context Post-Retrieval Prompt Template Response This is where re-ranking models come into play. Because re-ranking models process the query and the documents together at query time, they can capture more contextual nuances. However, they are usually complex and resource-intensive and thus not suitable for first-stage retrieval like vector search. By combining vector search with re-ranking models, you can quickly cast a wide net of potential candidates and then re-order them to improve the quality of relevant context in your prompt. Note that when using a re-ranking model, you should over-retrieve chunks to filter out less relevant ones later. LLM Query Vector Database Post-Retrieval Optimization Techniques Post-retrieval optimization techniques aim to enhance the quality of generated responses, meaning that their work begins after the retrieval process has been completed. This diverse group of techniques includes using models to re-rank retrieved results, enhancing or compressing the retrieved context, prompt engineering, and fine-tuning the generative LLM on external data. Embedding Model Retrieved Context Re-ranking Reranker Model Re-ranked Context Prompt Template Response LLM 10

11. Advanced RAG Techniques | weaviate Ebook Context Post-Processing Context Compression After retrieval, it can be beneficial to post-process the retrieved context for generation. For example, if the retrieved context might benefit from additional information you can enhance it with metadata. On the other hand, if it contains redundant data, you can compress it. Context Enhancement with Metadata One post-processing technique is to use metadata to enhance the retrieved context with additional information to improve generation accuracy. While you can simply add additional information from the metadata, such as timestamps, document names, etc., you can also apply more creative techniques. Context enhancement is particularly useful when data needs to be pre-processed into smaller chunk sizes to achieve better retrieval precision that doesn’t contain enough contextual information to generate high-quality responses. In this case, you can apply a technique called “Sentence window retrieval”. This technique chunks the initial document into smaller pieces (usually single sentences) but stores a larger context window in its metadata. At retrieval time, the smaller chunks help improve retrieval precision. After retrieval, the retrieved smaller chunks are replaced with the larger context window to improve generation quality. RAG systems rely on diverse knowledge sources to retrieve relevant information. However, this often results in the retrieval of irrelevant or redundant data, which can lead to suboptimal responses and costly LLM calls (more tokens). Context compression effectively addresses this challenge by extracting only the most meaningful information from the retrieved data. This process begins with a base retriever that retrieves documents/chunks related to the query. These documents/chunks are then passed through a document compressor that shortens them and eliminates irrelevant content, ensuring that valuable data is not lost in a sea of extraneous information. Contextual compression reduces data volume, lowering retrieval and operational costs. Current research focuses on two main approaches: embedding-based and lexical-based compression, both of which aim to retain essential information while easing computational demands on RAG systems. Query Embedding Model Vector Database Context Window (n sentences before and after sentence) Sentence embedding Window (n sentences before and after sentence) LLM Context Compression Compressor LLM Compressed Context Prompt Template Response LLM 11

12. Advanced RAG Techniques | weaviate Ebook Prompt Engineering The generated outputs of LLMs are greatly influenced by the quality, tone, length, and structure of their corresponding prompts. Prompt engineering is the practice of optimizing LLM prompts to improve the quality and accuracy of generated output. Often one of the lowest-hanging fruits when it comes to techniques for improving RAG systems, prompt engineering does not require making changes to the underlying LLM itself. This makes it an efficient and accessible way to enhance performance without complex modifications. There are several different prompting techniques that are especially useful in improving RAG pipelines. Prompt Thought ... Thought Response Chain of Thought (CoT) prompting involves asking the model to “think step-by-step” and break down complex reasoning tasks into a series of intermediate steps. This can be especially useful when retrieved documents contain conflicting or dense information that requires careful analysis. Tree of Thoughts (ToT) prompting builds on CoT by instructing the model to evaluate its responses at each step in the problem-solving process or even generate several different solutions to a problem and choose the best result. This is useful in RAG when there are many potential pieces of evidence, and the model needs to weigh different possible answers based on multiple retrieved documents. ReAct (Reasoning and Acting) prompting combines CoT with agents, creating a system in which the model can generate thoughts and delegate actions to agents that interact with external data sources in an iterative process. ReAct can improve RAG pipelines by enabling LLMs to dynamically interact with retrieved documents, updating reasoning and actions based on external knowledge to provide more accurate and contextually relevant responses. 12

13. Advanced RAG Techniques | weaviate Ebook Summary LLM Fine-Tuning RAG Pipeline Documents RAG enhances generative models by enabling them to reference external data, improving response accuracy and relevance while mitigating hallucinations and information gaps. Naive RAG retrieves documents based on query similarity and directly feeds them into a generative model for response generation. However, more advanced techniques, like the ones detailed in this guide, can significantly improve the quality of RAG pipelines by enhancing the relevance and accuracy of the retrieved information. Chunks Query Embedding Model LLM Fine-tuning Vector Database Pre-trained LLM Context Prompt Template Domain- specific dataset Fine-tuned LLM Response LLM Pre-trained LLMs are trained on large, diverse datasets to acquire a sense of general knowledge, including language and grammar patterns, extensive vocabularies, and the ability to perform general tasks. When it comes to RAG, using pre-trained LLMs can sometimes result in generated output that is too generic, factually incorrect, or fails to directly address the retrieved context. Fine-tuning a pre-trained model involves training it further on a specific dataset or task to adapt the model's general knowledge to the nuances of that particular domain, improving its performance in that area. Using a fine-tuned model in RAG pipelines can help improve the quality of generated responses, especially when the topic at hand is highly specialized. High-quality domain-specific data is crucial for fine-tuning LLMs. Labeled datasets, like positive and negative customer reviews, can help fine-tuned models better perform downstream tasks like text classification or sentiment analysis. Unlabeled datasets, on the other hand, like the latest articles published on PubMed, can help fine-tuned models gain more domain-specific knowledge and expand their vocabularies. This e-book reviewed advanced RAG techniques that can be applied at various stages of the RAG pipeline to improve retrieval quality and accuracy of generated responses. Indexing optimization techniques, like data preprocessing and chunking focus on formatting external data to improve its efficiency and searchability. Pre-retrieval techniques aim to optimizing the user query itself by rewriting, reformatting, or routing queries to specialized pipelines Retrieval optimization strategies often focus on refining search results during the retrieval phase. Post-retrieval optimization strategies aim to improve the accuracy of generated results through a variety of techniques including, re-ranking retrieved results, enhancing or compressing the (retrieved) context, and manipulating the prompt or generative model (LLM). We recommend implementing a validation pipeline to identify which parts of your RAG system need optimization and to assess the effectiveness of advanced techniques. Evaluating your RAG pipeline enables continuous monitoring and refinement, ensuring that optimizations positively impact retrieval quality and model performance. Ready to supercharge your RAG applications? Start building today with a 14 day free trial of Weaviate Cloud (WCD). During the fine-tuning process, the model weights of the pre-trained LLM (also referred to as a base model) are iteratively updated through a process called backpropagation to learn from the domain-specific dataset. The result is a fine-tuned LLM that better captures the nuances and requirements of the new data, such as specific terminology, style, or tone. Try Now Contact Us 3