Introduction
Following the same philosophy, I aim to first learn about the RAG system by constructing a generic version and then applying optimization techniques. This method will assist me in understanding the pain points, which will in turn guide our brainstorming toward viable applications. As someone who is not very tech-savvy, my primary goal is to grasp the mental model underlying the RAG system.
Imagine you are Harry Potter. You want to find everything about the fantastic beast, Niffler, to prepare for your exam. You’re led to the library and wish to retrieve all relevant pieces of information from the books, for example, the beast likes shiny objects, is british, and chubby and cute, and visualize Niffler. Then, all you need is RAG :)
RAG stands for “Retrieval-Augmented Generation.” This approach can be particularly powerful for tasks like question answering, where it’s important to provide responses that are not only fluent and coherent (thanks to the generative model) but also deeply grounded in specific information retrieved from the text corpus. The idea behind RAG is to combine the strengths of two types of models: a retrieval model and a generative model.
Retrieval Model - Finding the Puzzle Pieces: In our analogy, the retrieval model is like having a magic bookshelf. Whenever you need a specific puzzle piece, you describe what you’re looking for, and the bookshelf presents you with a selection of puzzle pieces that closely match your description. These pieces are like the relevant pieces of information or text snippets retrieved from a vast database. The bookshelf doesn’t give you the exact position where each piece goes, but it ensures the pieces are relevant to the part of the puzzle you’re working on.
Generative Model - Putting the Puzzle Together: Once you have the right pieces (information), the generative model is like your skill in putting these pieces together in a way that makes sense. It looks at the pieces you’ve gathered, considers how they fit with what you’ve already assembled (the context or the prompt you’re responding to), and then skillfully places them to form a coherent picture (or, in the case of RAG, a coherent piece of text).

Visualization of RAG System
Here’s an animated overview of a RAG system. More theoretical and techical details in this paper (Lewis et al, 2005).
Now, we have an overview of the RAG system, which is a good starting point. Next, we’ll delve into technical details and discuss how to enhance and improve your RAG system.
Implementation of RAG system
Intuitively, a well-performing RAG system should, first of all, be able to retrieve relevant and coherent information; secondly, it should operate efficiently. Keeping these two points in mind, we will discuss the techniques in greater depth.
Step 1: Index Data
Load documents
Chunk documents
- Goal: create document chunks that are concise and meaningful so that it’s effective to retrieve relevant pieces of information. Here are some optimization techniques:
- Chunk size: smaller chunks, less noise, less context; larger chunks, more noise, more context. Developers need to keep a balance.
- Document hierarchies: provide a file directory to organize the chunks of information and help you locate the most relevant information with speed and repeatability. For example, if we want to get a summary only about the medications newly added to the pharmacies in California after 2022, we can leverage the contextual information, year, to locate the relevant information.
Vectorize document chunks -> embeddings + document chunk ID
Store embeddings in vector database
Step 2: Retrieve relevant information
Vectoriz user query to create embedding
Retrieve relevant document chunk IDs
Retrieve document chunks from vector database by document chunk IDs
Use LLM on user query + relevant document chunks to generate the response
Step 3: Generate response from LLM
Optimize prompt
Generate response with fine-tuned LLM
Challenges & Solutions
We layout the basic implementation of RAG system. Here, we discuss challenges to make the system more advanced and their corresponding resolutions.
-
Bad retrieval
- Low precision: not all chunks retrieved are relevant due to hallucination + lost in the middle problems.
- Low recall: not all relevant chunks are retrieved due to limited context for LLM synthesize an answer.
- Obsolete information: the data is out of date or redundant.
-
Bad response
- Hallucination: LLM makes up an answer that isn’t in the context.
- Irrelevance: LLM makes up an answer that doesn’t answer the question.
- Toxicity/bias: LLM makes up an answer that’s harmful or offensive.
-
Data: Can we store additional information beyond raw text chunks?
-
Embeddings: Can we optimize our embedding representations?
-
Retrieval:
- Chellenge: Can we do better than top-k embedding lookup given the context that multiple, disparate document sections can be relevant to a given question.
- Solution: We can allow for varying number of retrieved chunks based on relevance scores. Moreover, we can incorporate subject-matter expert annotations and logic used for training set (Snorkel AI - Enterprise LLM Summit, 2024)
-
Synthesis: can we use LLMs for more than generation?
Multiple, disparate document sections can be relevant to a given question
Evaluation
- How do we properly evaluate a RAG system?
-
Evaluate E23
- Evaluate the final generated response given input
- Create dataset
- Input: query
- [Optional]output: the “ground-truth” answer
- Run through full RAG pipeline
- Collect evaluation metrics
- If no labels: label-free evaluator
- Faithfulness
- Relevancy
- Adheres to guidelines
- Toxicity-free
- If labels: with-label evaluator
- Accuracy
- If no labels: label-free evaluator
-
Evaluate the separate parts (retrieval, synthesis). Diagnosis which part needs improvements?
- Retrieval - retriever evaluator (MRR, precision@k, NDCG)
- Evaluate the quality of retrieved chunks given user query
- Create dataset (human-labeled or synthetic)
- Input: query
- Output: the “ground-truth” documents relevant to the query
- Run retriever over dataset
- Measure ranking metrics
- Success rate / hit-rate
- MRR
- Hit-rate
- Retrieval - retriever evaluator (MRR, precision@k, NDCG)
-
Discussion
Why we feel excited about RAG? RAG, the orchestration of a retriever and a generator by connecting to factual data, reduces hallucination. For business specific tasks, running RAG on domain-specific documents generate responses of higher quality.
Reference
Some of the development techniques we’ll cover are sourced from public resources, including
- Llamaindex’s talk on how to build production-ready RAG apps
- Langchain explianed in 13 mins
- Medium blog: first intro to complex RAG
- MangoDB Vector Search
Projects
WIP, stay tuned.