5.3 Tool Error Handling & Retries
Guardrails, tool feedback loops, self-correction prompts
Overview
When AI agents interact with external tools and APIs, failures are inevitable. Network timeouts, API rate limits, malformed requests, and unexpected responses are all common scenarios that require robust error handling strategies. This section covers comprehensive approaches to building resilient tool-calling systems.
Key Error Handling Principles
- Graceful Degradation: Fail gracefully with meaningful feedback
- Intelligent Retries: Use exponential backoff and circuit breakers
- Self-Correction: Enable agents to learn from errors and adapt
- User Transparency: Communicate errors clearly to users
Common Tool Error Categories
Network & Connectivity
- Connection timeouts
- DNS resolution failures
- SSL/TLS certificate issues
- Proxy/firewall blocks
API Rate Limiting
- Request per minute limits
- Concurrent connection limits
- Quota exhaustion
- Token bucket depletion
Input Validation
- Malformed JSON/XML
- Missing required parameters
- Invalid data types
- Schema violations
Service Errors
- Internal server errors (5xx)
- Authentication failures
- Resource not found (404)
- Service unavailable (503)
Retry Strategies & Patterns
1. Exponential Backoff with Jitter
import time
import random
from typing import Callable, Any
def exponential_backoff_retry(
func: Callable,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
exponential_base: float = 2.0,
jitter: bool = True
) -> Any:
"""Retry function with exponential backoff and optional jitter"""
for attempt in range(max_retries + 1):
try:
return func()
except Exception as e:
if attempt == max_retries:
raise e
# Calculate delay with exponential backoff
delay = min(base_delay * (exponential_base ** attempt), max_delay)
# Add jitter to prevent thundering herd
if jitter:
delay *= (0.5 + random.random() * 0.5)
time.sleep(delay)
2. Circuit Breaker Pattern
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
def call(self, func):
if self.state == "open":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half-open"
else:
raise Exception("Circuit breaker is open")
try:
result = func()
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise e
Self-Correction Prompts
When tools fail, agents can often self-correct by analyzing the error and adjusting their approach:
SELF_CORRECTION_PROMPT = """
You attempted to call a tool but received an error. Here's what happened:
Tool Called: {tool_name}
Parameters: {parameters}
Error: {error_message}
Please analyze this error and either:
1. Correct the parameters and try again
2. Use an alternative tool or approach
3. Explain why the task cannot be completed
Guidelines:
- Check parameter types and formats
- Verify required fields are present
- Consider alternative tools if this one is unavailable
- Be specific about what went wrong and how you're fixing it
Your response:
"""
Contextual Error Recovery
def handle_tool_error(tool_name, params, error, context):
"""Generate context-aware error recovery prompt"""
error_type = classify_error(error)
if error_type == "validation":
prompt = f"""
Parameter validation failed for {tool_name}.
Error: {error}
Please check these common issues:
- Required fields: {get_required_fields(tool_name)}
- Data types: {get_expected_types(tool_name)}
- Format examples: {get_format_examples(tool_name)}
Correct the parameters and try again.
"""
elif error_type == "rate_limit":
prompt = f"""
Rate limit exceeded for {tool_name}.
Wait before retrying or use an alternative approach.
Alternatives:
{suggest_alternative_tools(tool_name, context)}
"""
return prompt
Implementing Guardrails
1. Input Validation Guardrails
class ToolGuardrails:
def validate_input(self, tool_name, params):
# Schema validation
schema = self.get_tool_schema(tool_name)
if not self.validate_schema(params, schema):
raise ValueError(f"Invalid parameters for {tool_name}")
# Security checks
if self.contains_sensitive_data(params):
raise SecurityError("Sensitive data detected")
# Rate limiting
if not self.check_rate_limit(tool_name):
raise RateLimitError("Rate limit exceeded")
def validate_output(self, tool_name, result):
# Output sanitization
sanitized = self.sanitize_output(result)
# Content filtering
if self.contains_harmful_content(sanitized):
raise ContentError("Harmful content detected")
return sanitized
2. Timeout Management
import asyncio
from contextlib import asynccontextmanager
@asynccontextmanager
async def timeout_context(seconds):
try:
yield await asyncio.wait_for(
asyncio.create_task(operation()),
timeout=seconds
)
except asyncio.TimeoutError:
raise ToolTimeoutError(f"Operation timed out after {seconds}s")
Tool Feedback Loops
Implement feedback mechanisms to improve tool reliability over time:
class ToolFeedbackSystem:
def __init__(self):
self.success_metrics = {}
self.error_patterns = {}
self.performance_history = {}
def record_tool_usage(self, tool_name, success,
latency, error=None):
# Track success rates
if tool_name not in self.success_metrics:
self.success_metrics[tool_name] = {'total': 0, 'success': 0}
self.success_metrics[tool_name]['total'] += 1
if success:
self.success_metrics[tool_name]['success'] += 1
# Track error patterns
if error:
error_type = classify_error(error)
if tool_name not in self.error_patterns:
self.error_patterns[tool_name] = {}
if error_type not in self.error_patterns[tool_name]:
self.error_patterns[tool_name][error_type] = 0
self.error_patterns[tool_name][error_type] += 1
def get_tool_reliability(self, tool_name):
if tool_name not in self.success_metrics:
return 0.0
metrics = self.success_metrics[tool_name]
return metrics['success'] / metrics['total']
def suggest_best_tool(self, task_type):
# Recommend tools based on historical performance
eligible_tools = self.get_tools_for_task(task_type)
return max(eligible_tools, key=self.get_tool_reliability)
Best Practices Summary
Retry Logic
- Use exponential backoff with jitter
- Implement maximum retry limits
- Distinguish between retryable and non-retryable errors
- Log retry attempts for debugging
Error Classification
- Categorize errors by type and severity
- Create error hierarchies for handling
- Use structured error objects
- Maintain error code standards
Monitoring & Alerting
- Track error rates and patterns
- Set up automated alerts for failures
- Monitor tool performance metrics
- Create error dashboards
User Experience
- Provide clear error messages
- Suggest alternative approaches
- Show progress during retries
- Allow user intervention when needed