Stevora

Retries & Error Handling

Automatic retry policies with exponential backoff

Retries & Error Handling

Steps fail. APIs return 500s, LLM providers have outages, rate limits kick in. Stevora handles this with automatic retries using exponential backoff and jitter, so transient failures resolve without manual intervention.

The RetryPolicy Interface

Every step can declare a retry policy. The engine resolves it to this interface:

interface RetryPolicy {
  maxAttempts: number;
  backoffMs: number;
  backoffMultiplier: number;
  maxBackoffMs: number;
}
FieldDescriptionDefault
maxAttemptsTotal number of attempts (including the first). A value of 3 means 1 initial try + 2 retries.3
backoffMsBase delay before the first retry, in milliseconds.1000
backoffMultiplierMultiplier applied to the delay after each attempt.2
maxBackoffMsUpper bound on the delay. No retry will wait longer than this.300000 (5 min)

Default Policy

If a step does not specify a retry policy, the engine applies a sensible default:

const DEFAULT_RETRY_POLICY: RetryPolicy = {
  maxAttempts: 3,
  backoffMs: 1000,
  backoffMultiplier: 2,
  maxBackoffMs: 300_000,
};

Configuring Retries Per Step

Set the retry field on any step definition:

{
  "type": "task",
  "name": "call-external-api",
  "handler": "callExternalApi",
  "retry": {
    "maxAttempts": 5,
    "backoffMs": 2000,
    "backoffMultiplier": 3
  }
}

Any field you omit falls back to the default. The schema validates the bounds:

const retryPolicySchema = z.object({
  maxAttempts: z.number().int().min(1).max(20).default(3),
  backoffMs: z.number().int().min(100).default(1000),
  backoffMultiplier: z.number().min(1).max(10).default(2),
}).optional();

Exponential Backoff with Jitter

The delay between retries is computed with exponential backoff, capped at maxBackoffMs, and randomized with jitter to avoid thundering-herd problems when many workflows retry simultaneously.

function computeNextRetryDelay(policy: RetryPolicy, attemptCount: number): number {
  const raw = policy.backoffMs * Math.pow(policy.backoffMultiplier, attemptCount - 1);
  const capped = Math.min(raw, policy.maxBackoffMs);
  return Math.floor(capped * (0.75 + Math.random() * 0.25));
}

The jitter factor (0.75 + Math.random() * 0.25) means the actual delay is between 75% and 100% of the computed value. This spreads out retries without straying too far from the target.

Example: Default Policy Delays

With the default policy (backoffMs: 1000, backoffMultiplier: 2):

AttemptBase DelayRange (with jitter)
1st retry (attempt 2)1,000 ms750 ms -- 1,000 ms
2nd retry (attempt 3)2,000 ms1,500 ms -- 2,000 ms
3rd retry (attempt 4)4,000 ms3,000 ms -- 4,000 ms
4th retry (attempt 5)8,000 ms6,000 ms -- 8,000 ms

The shouldRetry Decision

Before scheduling a retry, the engine checks whether any attempts remain:

function shouldRetry(policy: RetryPolicy, attemptCount: number): boolean {
  return attemptCount < policy.maxAttempts;
}

If the current attempt count has reached maxAttempts, no retry is scheduled and the workflow transitions to FAILED.

Step Failure Flow

When a step returns a failed result, the engine follows this decision tree:

Step returns { status: 'failed' }

         v
  shouldRetry(policy, attemptCount)?

    ┌────┴────┐
    Yes       No
    │         │
    v         v
  Schedule    Mark step FAILED
  delayed     Mark workflow FAILED
  retry job   Emit STEP_FAILED +
    │         WORKFLOW_FAILED events
    v         Dispatch webhook
  Emit STEP_FAILED +
  RETRY_SCHEDULED events
  Workflow → WAITING

    v
  After delay, job fires
  Step re-executes from
  attempt N+1

Retry Scheduling

When a retry is scheduled, the engine:

  1. Marks the step as FAILED with the error details.
  2. Computes the next retry delay and stores nextRetryAt on the step run.
  3. Transitions the workflow to WAITING.
  4. Emits STEP_FAILED and RETRY_SCHEDULED events.
  5. Enqueues a delayed job via BullMQ:
await enqueueWorkflowJob(
  {
    workflowRunId: run.id,
    workspaceId: run.workspaceId,
    action: 'retry_step',
    stepName: stepDef.name,
  },
  { delay: delayMs, jobId: `retry-${run.id}-${stepDef.name}-${currentAttempt}` },
);

The unique jobId ensures the same retry is not enqueued twice.

Terminal Failure

When all retries are exhausted, the engine calls transitionWorkflowToFailed:

await transitionWorkflowToFailed(
  run,
  `Step '${stepDef.name}' failed after ${currentAttempt} attempts`,
  stepRun.id,
  result.error,
);

This marks the workflow as FAILED, records the error, emits events, and dispatches a workflow.failed webhook.

Manual Retry from FAILED

A failed workflow can be retried through the API. This resets the failed step and re-enqueues it:

POST /api/v1/workflow-runs/:id/retry

The engine finds the failed step, resets its status to PENDING and its attempt count to 0, transitions the workflow back to PENDING, and enqueues the step for execution. The retry policy starts fresh.