Tool Error Handling & Retries

Overview

When AI agents interact with external tools and APIs, failures are inevitable. Network timeouts, API rate limits, malformed requests, and unexpected responses are all common scenarios that require robust error handling strategies. This section covers comprehensive approaches to building resilient tool-calling systems.

            Key Error Handling Principles
            Graceful Degradation: Fail gracefully with meaningful feedback
Intelligent Retries: Use exponential backoff and circuit breakers
Self-Correction: Enable agents to learn from errors and adapt
User Transparency: Communicate errors clearly to users

        

Common Tool Error Categories

Network & Connectivity

Connection timeouts
DNS resolution failures
SSL/TLS certificate issues
Proxy/firewall blocks

API Rate Limiting

Request per minute limits
Concurrent connection limits
Quota exhaustion
Token bucket depletion

Input Validation

Malformed JSON/XML
Missing required parameters
Invalid data types
Schema violations

Service Errors

Internal server errors (5xx)
Authentication failures
Resource not found (404)
Service unavailable (503)

Retry Strategies & Patterns

1. Exponential Backoff with Jitter

import time
import random
from typing import Callable, Any

def exponential_backoff_retry(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    exponential_base: float = 2.0,
    jitter: bool = True
) -> Any:
    """Retry function with exponential backoff and optional jitter"""
    
    for attempt in range(max_retries + 1):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries:
                raise e
                
            # Calculate delay with exponential backoff
            delay = min(base_delay * (exponential_base ** attempt), max_delay)
            
            # Add jitter to prevent thundering herd
            if jitter:
                delay *= (0.5 + random.random() * 0.5)
                
            time.sleep(delay)

2. Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open
    
    def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "half-open"
            else:
                raise Exception("Circuit breaker is open")
        
        try:
            result = func()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise e

Self-Correction Prompts

When tools fail, agents can often self-correct by analyzing the error and adjusting their approach:

SELF_CORRECTION_PROMPT = """
You attempted to call a tool but received an error. Here's what happened:

Tool Called: {tool_name}
Parameters: {parameters}
Error: {error_message}

Please analyze this error and either:
1. Correct the parameters and try again
2. Use an alternative tool or approach
3. Explain why the task cannot be completed

Guidelines:
- Check parameter types and formats
- Verify required fields are present
- Consider alternative tools if this one is unavailable
- Be specific about what went wrong and how you're fixing it

Your response:
"""

Contextual Error Recovery

def handle_tool_error(tool_name, params, error, context):
    """Generate context-aware error recovery prompt"""
    
    error_type = classify_error(error)
    
    if error_type == "validation":
        prompt = f"""
        Parameter validation failed for {tool_name}.
        Error: {error}
        
        Please check these common issues:
        - Required fields: {get_required_fields(tool_name)}
        - Data types: {get_expected_types(tool_name)}
        - Format examples: {get_format_examples(tool_name)}
        
        Correct the parameters and try again.
        """
    
    elif error_type == "rate_limit":
        prompt = f"""
        Rate limit exceeded for {tool_name}.
        Wait before retrying or use an alternative approach.
        
        Alternatives:
        {suggest_alternative_tools(tool_name, context)}
        """
    
    return prompt

Implementing Guardrails

1. Input Validation Guardrails

class ToolGuardrails:
    def validate_input(self, tool_name, params):
        # Schema validation
        schema = self.get_tool_schema(tool_name)
        if not self.validate_schema(params, schema):
            raise ValueError(f"Invalid parameters for {tool_name}")
        
        # Security checks
        if self.contains_sensitive_data(params):
            raise SecurityError("Sensitive data detected")
        
        # Rate limiting
        if not self.check_rate_limit(tool_name):
            raise RateLimitError("Rate limit exceeded")
    
    def validate_output(self, tool_name, result):
        # Output sanitization
        sanitized = self.sanitize_output(result)
        
        # Content filtering
        if self.contains_harmful_content(sanitized):
            raise ContentError("Harmful content detected")
        
        return sanitized

2. Timeout Management

import asyncio
from contextlib import asynccontextmanager

@asynccontextmanager
async def timeout_context(seconds):
    try:
        yield await asyncio.wait_for(
            asyncio.create_task(operation()),
            timeout=seconds
        )
    except asyncio.TimeoutError:
        raise ToolTimeoutError(f"Operation timed out after {seconds}s")

Tool Feedback Loops

Implement feedback mechanisms to improve tool reliability over time:

class ToolFeedbackSystem:
    def __init__(self):
        self.success_metrics = {}
        self.error_patterns = {}
        self.performance_history = {}
    
    def record_tool_usage(self, tool_name, success, 
                           latency, error=None):
        # Track success rates
        if tool_name not in self.success_metrics:
            self.success_metrics[tool_name] = {'total': 0, 'success': 0}
        
        self.success_metrics[tool_name]['total'] += 1
        if success:
            self.success_metrics[tool_name]['success'] += 1
        
        # Track error patterns
        if error:
            error_type = classify_error(error)
            if tool_name not in self.error_patterns:
                self.error_patterns[tool_name] = {}
            if error_type not in self.error_patterns[tool_name]:
                self.error_patterns[tool_name][error_type] = 0
            self.error_patterns[tool_name][error_type] += 1
    
    def get_tool_reliability(self, tool_name):
        if tool_name not in self.success_metrics:
            return 0.0
        
        metrics = self.success_metrics[tool_name]
        return metrics['success'] / metrics['total']
    
    def suggest_best_tool(self, task_type):
        # Recommend tools based on historical performance
        eligible_tools = self.get_tools_for_task(task_type)
        return max(eligible_tools, key=self.get_tool_reliability)

Best Practices Summary

Retry Logic

Use exponential backoff with jitter
Implement maximum retry limits
Distinguish between retryable and non-retryable errors
Log retry attempts for debugging

Error Classification

Categorize errors by type and severity
Create error hierarchies for handling
Use structured error objects
Maintain error code standards

Monitoring & Alerting

Track error rates and patterns
Set up automated alerts for failures
Monitor tool performance metrics
Create error dashboards

User Experience

Provide clear error messages
Suggest alternative approaches
Show progress during retries
Allow user intervention when needed

5.3 Tool Error Handling & Retries