prev next

Chapter 9.3: Retrieval-Augmented Agents

Standard LLMs rely solely on their internal, parametric knowledge learned during training. This knowledge can become outdated and lacks access to private or real-time information. Retrieval-Augmented Generation (RAG) enhances LLMs by allowing them to first retrieve relevant information from an external knowledge source before generating a response. This makes agents more knowledgeable, accurate, and trustworthy.

The RAG Process: A Mathematical View

The core idea of RAG is to augment the model's prompt with relevant external data. The process involves two main stages: Retrieval and Generation.

  1. Retrieval:

    Given a query q, we first encode it into a high-dimensional vector v_q = Encoder(q). We then use this vector to search a vector database (or index) of document chunks.

    Similarity Search: Score(v_q, v_d) = cos(v_q, v_d) = (v_q · v_d) / (||v_q|| ||v_d||)

    We retrieve the top-k documents {d₁, d₂, ..., dₖ} that have the highest cosine similarity score with the query vector.

  2. Generation:

    The retrieved documents are then concatenated with the original query to form an augmented prompt. This prompt is fed to the LLM to generate the final answer a.

    Augmented Prompt = "Context: [d₁, d₂, ..., dₖ] \n\n Query: [q]"
    Answer a = LLM(Augmented Prompt)

Visualization: The RAG Pipeline

The D3.js visualization below demonstrates the RAG flow. A user query is encoded and used to find relevant documents in a vector space. These documents are then passed to the LLM to generate an informed answer.