Everything About RAG. Learn In Progress

Introduction

Following the same philosophy, I aim to first learn about the RAG system by constructing a generic version and then applying optimization techniques. This method will assist me in understanding the pain points, which will in turn guide our brainstorming toward viable applications. As someone who is not very tech-savvy, my primary goal is to grasp the mental model underlying the RAG system.

Imagine you are Harry Potter. You want to find everything about the fantastic beast, Niffler, to prepare for your exam. You’re led to the library and wish to retrieve all relevant pieces of information from the books, for example, the beast likes shiny objects, is british, and chubby and cute, and visualize Niffler. Then, all you need is RAG :)

RAG stands for “Retrieval-Augmented Generation.” This approach can be particularly powerful for tasks like question answering, where it’s important to provide responses that are not only fluent and coherent (thanks to the generative model) but also deeply grounded in specific information retrieved from the text corpus. The idea behind RAG is to combine the strengths of two types of models: a retrieval model and a generative model.

Retrieval Model - Finding the Puzzle Pieces: In our analogy, the retrieval model is like having a magic bookshelf. Whenever you need a specific puzzle piece, you describe what you’re looking for, and the bookshelf presents you with a selection of puzzle pieces that closely match your description. These pieces are like the relevant pieces of information or text snippets retrieved from a vast database. The bookshelf doesn’t give you the exact position where each piece goes, but it ensures the pieces are relevant to the part of the puzzle you’re working on.

Generative Model - Putting the Puzzle Together: Once you have the right pieces (information), the generative model is like your skill in putting these pieces together in a way that makes sense. It looks at the pieces you’ve gathered, considers how they fit with what you’ve already assembled (the context or the prompt you’re responding to), and then skillfully places them to form a coherent picture (or, in the case of RAG, a coherent piece of text).

harry_potter

Harry Potter hopes to utilize RAG to collect clues on niffler. (image generated by Midjourney)

Niffler visualized through clues. (image generated by Midjourney)

Visualization of RAG System

Here’s an animated overview of a RAG system. More theoretical and techical details in this paper (Lewis et al, 2005).

rag

Source: https://huggingface.co/blog/ray-rag

Now, we have an overview of the RAG system, which is a good starting point. Next, we’ll delve into technical details and discuss how to enhance and improve your RAG system.

Implementation of RAG system

Intuitively, a well-performing RAG system should, first of all, be able to retrieve relevant and coherent information; secondly, it should operate efficiently. Keeping these two points in mind, we will discuss the techniques in greater depth.

snorkel

Source: Snorkel AI Enterprise LLM Summit

Step 1: Index Data

Load documents

Chunk documents

Goal: create document chunks that are concise and meaningful so that it’s effective to retrieve relevant pieces of information. Here are some optimization techniques:
- Chunk size: smaller chunks, less noise, less context; larger chunks, more noise, more context. Developers need to keep a balance.
- Document hierarchies: provide a file directory to organize the chunks of information and help you locate the most relevant information with speed and repeatability. For example, if we want to get a summary only about the medications newly added to the pharmacies in California after 2022, we can leverage the contextual information, year, to locate the relevant information.

Vectorize document chunks -> embeddings + document chunk ID

Store embeddings in vector database

Step 2: Retrieve relevant information

Vectoriz user query to create embedding

Retrieve relevant document chunk IDs

Retrieve document chunks from vector database by document chunk IDs

Use LLM on user query + relevant document chunks to generate the response

Step 3: Generate response from LLM

Optimize prompt

Generate response with fine-tuned LLM

Challenges & Solutions

We layout the basic implementation of RAG system. Here, we discuss challenges to make the system more advanced and their corresponding resolutions.

Bad retrieval
- Low precision: not all chunks retrieved are relevant due to hallucination + lost in the middle problems.
- Low recall: not all relevant chunks are retrieved due to limited context for LLM synthesize an answer.
- Obsolete information: the data is out of date or redundant.
Bad response
- Hallucination: LLM makes up an answer that isn’t in the context.
- Irrelevance: LLM makes up an answer that doesn’t answer the question.
- Toxicity/bias: LLM makes up an answer that’s harmful or offensive.
Data: Can we store additional information beyond raw text chunks?
Embeddings: Can we optimize our embedding representations?
Retrieval:
- Chellenge: Can we do better than top-k embedding lookup given the context that multiple, disparate document sections can be relevant to a given question.
- Solution: We can allow for varying number of retrieved chunks based on relevance scores. Moreover, we can incorporate subject-matter expert annotations and logic used for training set (Snorkel AI - Enterprise LLM Summit, 2024)
Synthesis: can we use LLMs for more than generation?

Multiple, disparate document sections can be relevant to a given question

Evaluation

How do we properly evaluate a RAG system?
- Evaluate E23
  - Evaluate the final generated response given input
  - Create dataset
    - Input: query
    - [Optional]output: the “ground-truth” answer
  - Run through full RAG pipeline
  - Collect evaluation metrics
    - If no labels: label-free evaluator
      - Faithfulness
      - Relevancy
      - Adheres to guidelines
      - Toxicity-free
    - If labels: with-label evaluator
      - Accuracy
- Evaluate the separate parts (retrieval, synthesis). Diagnosis which part needs improvements?
  - Retrieval - retriever evaluator (MRR, precision@k, NDCG)
    - Evaluate the quality of retrieved chunks given user query
    - Create dataset (human-labeled or synthetic)
      - Input: query
      - Output: the “ground-truth” documents relevant to the query
    - Run retriever over dataset
    - Measure ranking metrics
      - Success rate / hit-rate
      - MRR
      - Hit-rate

Discussion

Why we feel excited about RAG? RAG, the orchestration of a retriever and a generator by connecting to factual data, reduces hallucination. For business specific tasks, running RAG on domain-specific documents generate responses of higher quality.

Reference

Some of the development techniques we’ll cover are sourced from public resources, including

Llamaindex’s talk on how to build production-ready RAG apps
Langchain explianed in 13 mins
Medium blog: first intro to complex RAG
MangoDB Vector Search

Projects

WIP, stay tuned.

Introduction#

Visualization of RAG System#

Implementation of RAG system#

Step 1: Index Data#

Load documents#

Chunk documents#

Vectorize document chunks -> embeddings + document chunk ID#

Store embeddings in vector database#

Step 2: Retrieve relevant information#

Vectoriz user query to create embedding#

Retrieve relevant document chunk IDs#

Retrieve document chunks from vector database by document chunk IDs#

Use LLM on user query + relevant document chunks to generate the response#

Step 3: Generate response from LLM#

Optimize prompt#

Generate response with fine-tuned LLM#

Challenges & Solutions#

Evaluation#

Discussion#

Reference#

Projects#