5.3 Tool Error Handling & Retries

Guardrails, tool feedback loops, self-correction prompts

Overview

When AI agents interact with external tools and APIs, failures are inevitable. Network timeouts, API rate limits, malformed requests, and unexpected responses are all common scenarios that require robust error handling strategies. This section covers comprehensive approaches to building resilient tool-calling systems.

Key Error Handling Principles

  • Graceful Degradation: Fail gracefully with meaningful feedback
  • Intelligent Retries: Use exponential backoff and circuit breakers
  • Self-Correction: Enable agents to learn from errors and adapt
  • User Transparency: Communicate errors clearly to users

Common Tool Error Categories

Network & Connectivity
  • Connection timeouts
  • DNS resolution failures
  • SSL/TLS certificate issues
  • Proxy/firewall blocks
API Rate Limiting
  • Request per minute limits
  • Concurrent connection limits
  • Quota exhaustion
  • Token bucket depletion
Input Validation
  • Malformed JSON/XML
  • Missing required parameters
  • Invalid data types
  • Schema violations
Service Errors
  • Internal server errors (5xx)
  • Authentication failures
  • Resource not found (404)
  • Service unavailable (503)

Retry Strategies & Patterns

1. Exponential Backoff with Jitter

import time import random from typing import Callable, Any def exponential_backoff_retry( func: Callable, max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, exponential_base: float = 2.0, jitter: bool = True ) -> Any: """Retry function with exponential backoff and optional jitter""" for attempt in range(max_retries + 1): try: return func() except Exception as e: if attempt == max_retries: raise e # Calculate delay with exponential backoff delay = min(base_delay * (exponential_base ** attempt), max_delay) # Add jitter to prevent thundering herd if jitter: delay *= (0.5 + random.random() * 0.5) time.sleep(delay)

2. Circuit Breaker Pattern

class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.failure_count = 0 self.last_failure_time = None self.state = "closed" # closed, open, half-open def call(self, func): if self.state == "open": if time.time() - self.last_failure_time > self.recovery_timeout: self.state = "half-open" else: raise Exception("Circuit breaker is open") try: result = func() if self.state == "half-open": self.state = "closed" self.failure_count = 0 return result except Exception as e: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = "open" raise e

Self-Correction Prompts

When tools fail, agents can often self-correct by analyzing the error and adjusting their approach:

SELF_CORRECTION_PROMPT = """ You attempted to call a tool but received an error. Here's what happened: Tool Called: {tool_name} Parameters: {parameters} Error: {error_message} Please analyze this error and either: 1. Correct the parameters and try again 2. Use an alternative tool or approach 3. Explain why the task cannot be completed Guidelines: - Check parameter types and formats - Verify required fields are present - Consider alternative tools if this one is unavailable - Be specific about what went wrong and how you're fixing it Your response: """

Contextual Error Recovery

def handle_tool_error(tool_name, params, error, context): """Generate context-aware error recovery prompt""" error_type = classify_error(error) if error_type == "validation": prompt = f""" Parameter validation failed for {tool_name}. Error: {error} Please check these common issues: - Required fields: {get_required_fields(tool_name)} - Data types: {get_expected_types(tool_name)} - Format examples: {get_format_examples(tool_name)} Correct the parameters and try again. """ elif error_type == "rate_limit": prompt = f""" Rate limit exceeded for {tool_name}. Wait before retrying or use an alternative approach. Alternatives: {suggest_alternative_tools(tool_name, context)} """ return prompt

Implementing Guardrails

1. Input Validation Guardrails

class ToolGuardrails: def validate_input(self, tool_name, params): # Schema validation schema = self.get_tool_schema(tool_name) if not self.validate_schema(params, schema): raise ValueError(f"Invalid parameters for {tool_name}") # Security checks if self.contains_sensitive_data(params): raise SecurityError("Sensitive data detected") # Rate limiting if not self.check_rate_limit(tool_name): raise RateLimitError("Rate limit exceeded") def validate_output(self, tool_name, result): # Output sanitization sanitized = self.sanitize_output(result) # Content filtering if self.contains_harmful_content(sanitized): raise ContentError("Harmful content detected") return sanitized

2. Timeout Management

import asyncio from contextlib import asynccontextmanager @asynccontextmanager async def timeout_context(seconds): try: yield await asyncio.wait_for( asyncio.create_task(operation()), timeout=seconds ) except asyncio.TimeoutError: raise ToolTimeoutError(f"Operation timed out after {seconds}s")

Tool Feedback Loops

Implement feedback mechanisms to improve tool reliability over time:

class ToolFeedbackSystem: def __init__(self): self.success_metrics = {} self.error_patterns = {} self.performance_history = {} def record_tool_usage(self, tool_name, success, latency, error=None): # Track success rates if tool_name not in self.success_metrics: self.success_metrics[tool_name] = {'total': 0, 'success': 0} self.success_metrics[tool_name]['total'] += 1 if success: self.success_metrics[tool_name]['success'] += 1 # Track error patterns if error: error_type = classify_error(error) if tool_name not in self.error_patterns: self.error_patterns[tool_name] = {} if error_type not in self.error_patterns[tool_name]: self.error_patterns[tool_name][error_type] = 0 self.error_patterns[tool_name][error_type] += 1 def get_tool_reliability(self, tool_name): if tool_name not in self.success_metrics: return 0.0 metrics = self.success_metrics[tool_name] return metrics['success'] / metrics['total'] def suggest_best_tool(self, task_type): # Recommend tools based on historical performance eligible_tools = self.get_tools_for_task(task_type) return max(eligible_tools, key=self.get_tool_reliability)

Best Practices Summary

Retry Logic
  • Use exponential backoff with jitter
  • Implement maximum retry limits
  • Distinguish between retryable and non-retryable errors
  • Log retry attempts for debugging
Error Classification
  • Categorize errors by type and severity
  • Create error hierarchies for handling
  • Use structured error objects
  • Maintain error code standards
Monitoring & Alerting
  • Track error rates and patterns
  • Set up automated alerts for failures
  • Monitor tool performance metrics
  • Create error dashboards
User Experience
  • Provide clear error messages
  • Suggest alternative approaches
  • Show progress during retries
  • Allow user intervention when needed