prev next

Chapter 10.2: Chunking & Index Construction

When working with large documents or knowledge bases, one of the most critical decisions is how to break down the content into manageable pieces for embedding and retrieval. Chunking is the process of splitting documents into smaller, semantically meaningful segments, while Index Construction involves organizing these chunks for efficient retrieval. The quality of chunking directly impacts the effectiveness of RAG systems.

The Chunking Challenge

Raw documents can vary dramatically in length—from short tweets to lengthy research papers. LLMs and embedding models have input token limits, so we must split large documents. However, naive splitting can break semantic boundaries, leading to incomplete or out-of-context chunks that hurt retrieval quality.

  • Too Small: Chunks lack sufficient context and may miss important relationships.
  • Too Large: Chunks contain too much irrelevant information, diluting semantic focus.
  • Poor Boundaries: Splitting mid-sentence or mid-concept leads to incoherent fragments.

Chunking Strategies

Mathematical Model of Retrieval

Long-term memory retrieval is often framed as a search problem. When a new query Q arrives, the agent converts it into a vector embedding v_Q. It then searches the memory (a collection of stored text-embedding pairs (text_i, v_i)) to find the most relevant information.

The relevance is typically calculated using a similarity function, such as cosine similarity:

Similarity(v_Q, v_i) = (v_Q ⋅ v_i) / (||v_Q|| * ||v_i||)

The agent retrieves the top-k texts corresponding to the highest similarity scores and includes them in the context provided to the LLM to generate a response.

Visualization: Memory Retrieval Process

The visualization below demonstrates how an agent uses both short-term and long-term memory. The query is first processed within the short-term context. If needed, the agent queries the long-term memory (vector database) to retrieve relevant past information before generating the final answer.