12.4 Prompt Compression & Context Optimization

Making Prompts Leaner and More Efficient

As LLM applications become more complex, the context window—the space for your prompt, instructions, and examples—becomes a valuable and limited resource. Prompt Compression and Context Optimization are techniques designed to reduce the number of tokens sent to the model while preserving the essential information needed to perform a task.

This is crucial for two main reasons:

Cost Reduction: Most LLM APIs charge based on the number of input and output tokens. Fewer tokens mean lower costs.
Performance: Shorter prompts can be processed faster, and they leave more room in the context window for the model to generate a detailed response or handle longer conversations.

Common Techniques:

Instruction Pruning: Removing redundant words, examples, or instructions that the model already understands well.
Summarization: For long conversation histories or large documents, using another LLM call to summarize the text before including it in the main prompt.
Selective Context: Instead of passing the entire conversation history, use a retrieval mechanism (like vector search) to find only the most relevant past messages or documents.
Token-Efficient Formatting: Using formats like JSON without unnecessary whitespace can save a surprising number of tokens.

Use the interactive demo below. Click "Compress Prompt" to see how a verbose prompt can be optimized, and observe the token savings in the chart.

Original Prompt

Compressed Prompt

← Previous Contents Next →