13.4 Context Window Extensions

Language models have a fixed context window, which is the maximum number of tokens they can consider when generating a response. This limitation is a major bottleneck for tasks requiring long-term consistency and understanding of extensive documents. While models with larger context windows are continuously being developed (e.g., up to 1 million tokens), they come with increased computational cost and latency.

Therefore, various techniques have been developed to "extend" the effective context window without modifying the model's architecture. These methods focus on intelligently managing the information that is fed into the fixed-size window.

Techniques for Extending Context

1. Sliding Windows

This is a simple but effective technique for processing long sequences. The context window "slides" over the document. To maintain continuity, a portion of the context from the previous window is carried over to the next.

Example: For a 4k token window processing a 10k token document, the first chunk is tokens 0-4095. The next chunk might be tokens 3072-7167, keeping the last 1k tokens from the previous window as overlap. A summary of the processed chunk can also be prepended to the next one.
2. Recurrence and Summarization

This approach mimics how RNNs handle sequences. After processing a chunk of text, the model generates a summary of that chunk. This summary is then included at the beginning of the context for the next chunk.

Example: Process tokens 0-4095. Generate a summary: "The first chapter introduces the main character, Alex." Then, for the next chunk, the prompt becomes: "[Summary: The first chapter introduces Alex.] [Content of second chunk...]". This is a form of state management.
3. External Memory Tools (RAG)

This is the most powerful and widely used pattern. Instead of trying to fit the entire context into the prompt, the context is stored in an external vector database. The model then retrieves only the most relevant chunks of information based on the current query.

Process:
  1. A large document is chunked and embedded into a vector store.
  2. When a user asks a question, the question is used to retrieve the most relevant chunks from the store.
  3. These retrieved chunks are then placed into the context window along with the question.
This approach effectively gives the model access to a near-infinite context, limited only by the quality of the retrieval step.

Trade-offs and Considerations

Technique Pros Cons
Sliding Window Simple to implement. Retains local context well. Can lose long-range dependencies. Overlap adds computational overhead.
Recurrence/Summarization Good for maintaining a running "state" of the conversation or document. Summarization can lead to information loss. Adds latency due to extra LLM calls.
External Memory (RAG) Highly scalable to massive documents. Very efficient as only relevant context is processed. Depends heavily on the quality of the retrieval system. More complex to set up.