12.4 Prompt Compression & Context Optimization
Making Prompts Leaner and More Efficient
As LLM applications become more complex, the context window—the space for your prompt, instructions, and examples—becomes a valuable and limited resource. Prompt Compression and Context Optimization are techniques designed to reduce the number of tokens sent to the model while preserving the essential information needed to perform a task.
This is crucial for two main reasons:
- Cost Reduction: Most LLM APIs charge based on the number of input and output tokens. Fewer tokens mean lower costs.
- Performance: Shorter prompts can be processed faster, and they leave more room in the context window for the model to generate a detailed response or handle longer conversations.
Common Techniques:
- Instruction Pruning: Removing redundant words, examples, or instructions that the model already understands well.
- Summarization: For long conversation histories or large documents, using another LLM call to summarize the text before including it in the main prompt.
- Selective Context: Instead of passing the entire conversation history, use a retrieval mechanism (like vector search) to find only the most relevant past messages or documents.
- Token-Efficient Formatting: Using formats like JSON without unnecessary whitespace can save a surprising number of tokens.
Use the interactive demo below. Click "Compress Prompt" to see how a verbose prompt can be optimized, and observe the token savings in the chart.