Error Handling¶

This guide covers handling various error scenarios in Tako VM, including the built-in resilience features.

Error Response Format¶

When execution fails, the response includes error details:

{
  "success": false,
  "output": null,
  "stdout": "",
  "stderr": "Traceback (most recent call last):\n...",
  "exit_code": 1,
  "error": "ZeroDivisionError: division by zero",
  "execution_time": 0.15
}

For async jobs, the ExecutionRecord includes structured error information:

{
  "execution_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "failed",
  "error": {
    "type": "division_error",
    "message": "Division by zero"
  }
}

Error Classification¶

Tako VM automatically classifies errors into specific types for easier handling:

Signal-Based Errors (Exit Codes)¶

Exit Code	Error Type	Description
137	`oom`	Out of memory (SIGKILL)
143	`cancelled`	Process terminated (SIGTERM)
139	`segfault`	Segmentation fault (SIGSEGV)
134	`abort`	Process aborted (SIGABRT)
136	`arithmetic_error`	Floating point exception (SIGFPE)
135	`bus_error`	Bus error (SIGBUS)
141	`pipe_error`	Broken pipe (SIGPIPE)

Python Error Types¶

Error Type	Python Exception	Description
`syntax_error`	SyntaxError, IndentationError	Invalid Python syntax
`import_error`	ImportError, ModuleNotFoundError	Missing module
`type_error`	TypeError	Type mismatch
`value_error`	ValueError	Invalid value
`key_error`	KeyError	Missing dictionary key
`index_error`	IndexError	List index out of range
`attribute_error`	AttributeError	Missing attribute
`name_error`	NameError	Undefined variable
`file_not_found`	FileNotFoundError	File doesn't exist
`recursion_error`	RecursionError	Stack overflow
`division_error`	ZeroDivisionError	Division by zero
`encoding_error`	UnicodeError	Encoding/decoding failure
`json_error`	JSONDecodeError	Invalid JSON

System Errors¶

Error Type	Description
`permission`	Permission denied
`os_error`	OS/IO error
`network_error`	Connection failed
`network_timeout`	Network request timed out
`timeout`	Execution exceeded time limit
`killed`	Process killed by system
`service_unavailable`	Docker circuit breaker open
`internal_error`	Tako VM internal error

Built-in Resilience Features¶

Circuit Breaker¶

Tako VM includes a circuit breaker that prevents cascading failures when Docker is unavailable:

GET /health

{
  "status": "healthy",
  "docker_available": true,
  "circuit_breaker": {
    "state": "closed",
    "failure_count": 0,
    "success_count": 5,
    "last_failure": null
  }
}

Circuit States:

State	Description
`closed`	Normal operation, requests pass through
`open`	Docker failing, requests rejected immediately
`half_open`	Testing recovery, limited requests allowed

When the circuit is open, jobs return immediately with:

{
  "error": {
    "type": "service_unavailable",
    "message": "Service temporarily unavailable"
  }
}

Automatic Retry¶

Tako VM automatically retries transient Docker failures with exponential backoff:

Max attempts: 2
Base delay: 1 second
Exponential backoff with jitter

Transient errors that trigger retry: - Circuit breaker open - Docker daemon connection issues - Temporary resource unavailability

Dead Letter Queue (DLQ)¶

Jobs that fail with internal errors are automatically added to a dead letter queue for investigation:

# View DLQ statistics
curl http://localhost:8000/dlq/stats

{
  "total": 3,
  "by_error_type": {
    "internal_error": 2,
    "service_unavailable": 1
  }
}

# List DLQ entries
curl http://localhost:8000/dlq

# Remove processed entry
curl -X DELETE http://localhost:8000/dlq/1

Correlation IDs¶

All requests are assigned a correlation ID for distributed tracing:

# Pass your own correlation ID
curl -H "X-Correlation-ID: my-trace-123" http://localhost:8000/execute/async ...

The correlation ID appears in: - Response headers (X-Correlation-ID) - Log messages - DLQ entries

Common Errors¶

Syntax Errors¶

response = requests.post("http://localhost:8000/execute", json={
    "code": "def broken(",  # Invalid syntax
    "input_data": {}
})

result = response.json()
# success: false
# stderr: "SyntaxError: unexpected EOF while parsing"

Runtime Errors¶

code = """
import json
with open('/input/data.json') as f:
    data = json.load(f)
result = 100 / data['value']  # Division by zero if value=0
"""

response = requests.post("http://localhost:8000/execute", json={
    "code": code,
    "input_data": {"value": 0}
})

result = response.json()
# success: false
# stderr contains: "ZeroDivisionError: division by zero"

Timeout Errors¶

response = requests.post("http://localhost:8000/execute", json={
    "code": "import time; time.sleep(60)",
    "input_data": {},
    "timeout": 5
})

result = response.json()
# success: false
# error: "Execution timeout exceeded (5s)"

Import Errors¶

code = """
import pandas  # Not available in 'default' environment
"""

response = requests.post("http://localhost:8000/execute", json={
    "code": code,
    "input_data": {},
    "job_type": "default"
})

result = response.json()
# success: false
# stderr: "ModuleNotFoundError: No module named 'pandas'"

Solution: Use the correct environment:

response = requests.post("http://localhost:8000/execute", json={
    "code": code,
    "input_data": {},
    "job_type": "data-processing"  # Has pandas
})

Handling Errors in Python¶

def execute_safely(code, input_data, **kwargs):
    """Execute code with error handling."""
    try:
        response = requests.post(
            "http://localhost:8000/execute",
            json={"code": code, "input_data": input_data, **kwargs},
            timeout=kwargs.get("timeout", 30) + 10  # HTTP timeout
        )
        response.raise_for_status()

    except requests.exceptions.Timeout:
        return {"success": False, "error": "HTTP request timed out"}
    except requests.exceptions.ConnectionError:
        return {"success": False, "error": "Cannot connect to Tako VM"}
    except requests.exceptions.HTTPError as e:
        return {"success": False, "error": f"HTTP error: {e}"}

    result = response.json()

    if not result["success"]:
        # Log or handle the error
        print(f"Execution failed: {result.get('error')}")
        print(f"Stderr: {result.get('stderr')}")

    return result

Handling Errors by Type¶

def handle_execution_result(result):
    """Handle execution result based on error type."""
    if result.get("success"):
        return result["output"]

    error = result.get("error", {})
    error_type = error.get("type") if isinstance(error, dict) else None

    # Permanent errors - don't retry
    if error_type in ["syntax_error", "import_error", "name_error"]:
        raise ValueError(f"Code error: {error}")

    # Resource errors - might need adjustment
    if error_type in ["oom", "timeout"]:
        raise ResourceError(f"Resource limit exceeded: {error}")

    # Transient errors - safe to retry
    if error_type in ["service_unavailable", "network_error"]:
        raise RetryableError(f"Temporary failure: {error}")

    # Unknown error
    raise RuntimeError(f"Execution failed: {error}")

Memory Errors (OOM)¶

When code exceeds memory limits:

{
  "success": false,
  "exit_code": 137,
  "error": {
    "type": "oom",
    "message": "Execution exceeded memory limit"
  }
}

Exit code 137 indicates the process was killed (OOM).

Validating Input¶

Prevent errors by validating before execution:

def validate_request(code, input_data, timeout=None):
    """Validate execution request."""
    errors = []

    # Check code
    if not code or not code.strip():
        errors.append("Code cannot be empty")

    if len(code) > 100_000:  # 100KB limit
        errors.append("Code exceeds size limit")

    # Check input
    try:
        json.dumps(input_data)
    except (TypeError, ValueError) as e:
        errors.append(f"Input not JSON serializable: {e}")

    # Check timeout
    if timeout is not None:
        if timeout < 1 or timeout > 300:
            errors.append("Timeout must be between 1 and 300 seconds")

    return errors

# Usage
errors = validate_request(code, input_data)
if errors:
    print(f"Validation failed: {errors}")
else:
    result = execute(code, input_data)

Retrying Failed Jobs¶

import time

def execute_with_retry(code, input_data, max_retries=3, **kwargs):
    """Execute with automatic retry on transient failures."""

    # Error types that are safe to retry
    RETRYABLE_ERRORS = {"service_unavailable", "network_error", "network_timeout"}
    # Error types that should not be retried
    PERMANENT_ERRORS = {"syntax_error", "import_error", "name_error", "type_error"}

    for attempt in range(max_retries):
        result = execute_safely(code, input_data, **kwargs)

        if result["success"]:
            return result

        # Check error type
        error = result.get("error", {})
        error_type = error.get("type") if isinstance(error, dict) else None

        # Don't retry permanent failures
        if error_type in PERMANENT_ERRORS:
            return result

        # Only retry transient failures
        if error_type not in RETRYABLE_ERRORS:
            return result

        # Retry with exponential backoff
        if attempt < max_retries - 1:
            wait = 2 ** attempt  # 1, 2, 4 seconds
            print(f"Retrying in {wait}s (attempt {attempt + 2}/{max_retries})...")
            time.sleep(wait)

    return result

Monitoring Error Rates¶

Use the health and DLQ endpoints to monitor system health:

def check_system_health():
    """Check Tako VM system health."""
    health = requests.get("http://localhost:8000/health").json()
    dlq = requests.get("http://localhost:8000/dlq/stats").json()

    issues = []

    if health["status"] != "healthy":
        issues.append(f"System degraded: {health['circuit_breaker']['state']}")

    if health["circuit_breaker"]["state"] == "open":
        issues.append("Circuit breaker open - Docker issues")

    if dlq["total"] > 10:
        issues.append(f"High DLQ count: {dlq['total']} failed jobs")

    return issues

Debugging Tips¶

Check stdout/stderr - Often contains stack traces
Use print statements - Debug output is captured
Test locally first - Run code outside Tako VM
Check environment - Ensure packages are available
Review limits - Memory, CPU, timeout settings
Check correlation ID - Trace requests through logs
Monitor circuit breaker - Check Docker health
Review DLQ - Investigate recurring failures

Next Steps¶

Async Jobs - Long-running task patterns
Environments - Configure job types with dependencies
Deployment - Deploy to production