Skip to content

Resilience Module

The resilience module (src/lib/core/resilience.ts) provides retry with exponential backoff and per-key circuit breakers for external API calls. It protects the gateway from cascading failures when upstream providers are slow or unavailable.

Why

  • Automatic retry — Transient failures (timeouts, 5xx errors) are retried with exponential backoff.
  • Circuit breaker — Repeated failures trip a circuit, rejecting requests immediately until the provider recovers.
  • Per-key isolation — Each provider/tenant combination has its own circuit breaker state.
  • Smart error classification — Authentication errors (401, 403) and validation errors (400, 404) are never retried.

Usage

typescript
import { withResilience } from '@/lib/core/resilience';

const result = await withResilience(
  () => providerRuntime.chat(messages),
  { key: 'openai:tenant_acme' }
);

With Configuration Overrides

typescript
const result = await withResilience(
  () => embeddingProvider.embed(text),
  {
    key: 'voyage:tenant_acme',
    retry: { maxAttempts: 5, initialDelayMs: 500 },
    circuitBreaker: { threshold: 10 },
  }
);

API Reference

withResilience<T>(operation, options): Promise<T>

Execute an async operation with retry and circuit breaker protection.

typescript
interface ResilienceOptions {
  key: string;                              // Circuit breaker key
  retry?: Partial<RetryConfig>;             // Override retry settings
  circuitBreaker?: Partial<CircuitBreakerConfig>; // Override CB settings
}

getCircuitState(key): CircuitBreakerState | undefined

Get the current circuit breaker state for monitoring.

getAllCircuitStates(): Map<string, CircuitBreakerState>

Get all circuit breaker states (for health/metrics endpoints).

resetCircuit(key): void

Manually reset a specific circuit breaker (admin recovery action).

resetAllCircuits(): void

Reset all circuit breakers.

Retry Behavior

Attempt 1 → fails → wait 200ms (± jitter)
Attempt 2 → fails → wait 400ms (± jitter)
Attempt 3 → fails → throw last error + record circuit failure

Retry Configuration

SettingEnv VariableDefault
EnabledGATEWAY_RETRY_ENABLEDtrue
Max attemptsGATEWAY_RETRY_MAX_ATTEMPTS3
Initial delayGATEWAY_RETRY_INITIAL_DELAY_MS200
Max delay cap5000ms
Jitter factor0.25 (±25%)

Non-Retryable Errors

These HTTP status codes are never retried:

StatusReason
400Bad request — fix the input
401Unauthorized — credentials are wrong
403Forbidden — access denied
404Not found — resource doesn't exist
409Conflict — duplicate resource
422Unprocessable — validation failure

Error messages containing unauthorized, forbidden, or api key also skip retry.

Circuit Breaker

The circuit breaker follows the standard three-state pattern:

         success
  ┌──────────────┐
  │              │
  ▼    failure   │
CLOSED ────────► OPEN
  ▲              │
  │    timeout   │
  │              ▼
  └──────── HALF-OPEN

              │ failure

             OPEN

States

StateBehavior
ClosedNormal operation. Failures are counted.
OpenAll requests are rejected immediately with CircuitOpenError. Timer is running.
Half-openAfter reset timeout, one probe request is allowed. Success → Closed, Failure → Open.

Circuit Breaker Configuration

SettingEnv VariableDefault
EnabledGATEWAY_CIRCUIT_BREAKER_ENABLEDtrue
Failure thresholdGATEWAY_CIRCUIT_BREAKER_THRESHOLD5
Reset timeoutGATEWAY_CIRCUIT_BREAKER_RESET_MS30000 (30s)

Error Handling

typescript
import { withResilience, CircuitOpenError } from '@/lib/core/resilience';

try {
  const result = await withResilience(fn, { key: 'provider:tenant' });
} catch (error) {
  if (error instanceof CircuitOpenError) {
    // Provider is temporarily unavailable — circuit is open
    return NextResponse.json({ error: 'Service temporarily unavailable' }, { status: 503 });
  }
  // Other error (after all retries exhausted)
  throw error;
}

Where It's Used

The gateway wraps these external calls with resilience:

  • LLM chat completions (per provider + tenant)
  • Embedding requests
  • Vector store operations (upsert, query, delete)
  • File storage operations

Rules

Mandatory: Wrap all external provider calls with withResilience(). Use a descriptive key that includes the provider and/or tenant for proper circuit isolation.

Community edition is AGPL-3.0. Commercial licensing and support are available separately.