Retries & Error Handling
Automatic retry policies with exponential backoff
Retries & Error Handling
Steps fail. APIs return 500s, LLM providers have outages, rate limits kick in. Stevora handles this with automatic retries using exponential backoff and jitter, so transient failures resolve without manual intervention.
The RetryPolicy Interface
Every step can declare a retry policy. The engine resolves it to this interface:
interface RetryPolicy {
maxAttempts: number;
backoffMs: number;
backoffMultiplier: number;
maxBackoffMs: number;
}| Field | Description | Default |
|---|---|---|
maxAttempts | Total number of attempts (including the first). A value of 3 means 1 initial try + 2 retries. | 3 |
backoffMs | Base delay before the first retry, in milliseconds. | 1000 |
backoffMultiplier | Multiplier applied to the delay after each attempt. | 2 |
maxBackoffMs | Upper bound on the delay. No retry will wait longer than this. | 300000 (5 min) |
Default Policy
If a step does not specify a retry policy, the engine applies a sensible default:
const DEFAULT_RETRY_POLICY: RetryPolicy = {
maxAttempts: 3,
backoffMs: 1000,
backoffMultiplier: 2,
maxBackoffMs: 300_000,
};Configuring Retries Per Step
Set the retry field on any step definition:
{
"type": "task",
"name": "call-external-api",
"handler": "callExternalApi",
"retry": {
"maxAttempts": 5,
"backoffMs": 2000,
"backoffMultiplier": 3
}
}Any field you omit falls back to the default. The schema validates the bounds:
const retryPolicySchema = z.object({
maxAttempts: z.number().int().min(1).max(20).default(3),
backoffMs: z.number().int().min(100).default(1000),
backoffMultiplier: z.number().min(1).max(10).default(2),
}).optional();Exponential Backoff with Jitter
The delay between retries is computed with exponential backoff, capped at maxBackoffMs, and randomized with jitter to avoid thundering-herd problems when many workflows retry simultaneously.
function computeNextRetryDelay(policy: RetryPolicy, attemptCount: number): number {
const raw = policy.backoffMs * Math.pow(policy.backoffMultiplier, attemptCount - 1);
const capped = Math.min(raw, policy.maxBackoffMs);
return Math.floor(capped * (0.75 + Math.random() * 0.25));
}The jitter factor (0.75 + Math.random() * 0.25) means the actual delay is between 75% and 100% of the computed value. This spreads out retries without straying too far from the target.
Example: Default Policy Delays
With the default policy (backoffMs: 1000, backoffMultiplier: 2):
| Attempt | Base Delay | Range (with jitter) |
|---|---|---|
| 1st retry (attempt 2) | 1,000 ms | 750 ms -- 1,000 ms |
| 2nd retry (attempt 3) | 2,000 ms | 1,500 ms -- 2,000 ms |
| 3rd retry (attempt 4) | 4,000 ms | 3,000 ms -- 4,000 ms |
| 4th retry (attempt 5) | 8,000 ms | 6,000 ms -- 8,000 ms |
The shouldRetry Decision
Before scheduling a retry, the engine checks whether any attempts remain:
function shouldRetry(policy: RetryPolicy, attemptCount: number): boolean {
return attemptCount < policy.maxAttempts;
}If the current attempt count has reached maxAttempts, no retry is scheduled and the workflow transitions to FAILED.
Step Failure Flow
When a step returns a failed result, the engine follows this decision tree:
Step returns { status: 'failed' }
│
v
shouldRetry(policy, attemptCount)?
│
┌────┴────┐
Yes No
│ │
v v
Schedule Mark step FAILED
delayed Mark workflow FAILED
retry job Emit STEP_FAILED +
│ WORKFLOW_FAILED events
v Dispatch webhook
Emit STEP_FAILED +
RETRY_SCHEDULED events
Workflow → WAITING
│
v
After delay, job fires
Step re-executes from
attempt N+1Retry Scheduling
When a retry is scheduled, the engine:
- Marks the step as
FAILEDwith the error details. - Computes the next retry delay and stores
nextRetryAton the step run. - Transitions the workflow to
WAITING. - Emits
STEP_FAILEDandRETRY_SCHEDULEDevents. - Enqueues a delayed job via BullMQ:
await enqueueWorkflowJob(
{
workflowRunId: run.id,
workspaceId: run.workspaceId,
action: 'retry_step',
stepName: stepDef.name,
},
{ delay: delayMs, jobId: `retry-${run.id}-${stepDef.name}-${currentAttempt}` },
);The unique jobId ensures the same retry is not enqueued twice.
Terminal Failure
When all retries are exhausted, the engine calls transitionWorkflowToFailed:
await transitionWorkflowToFailed(
run,
`Step '${stepDef.name}' failed after ${currentAttempt} attempts`,
stepRun.id,
result.error,
);This marks the workflow as FAILED, records the error, emits events, and dispatches a workflow.failed webhook.
Manual Retry from FAILED
A failed workflow can be retried through the API. This resets the failed step and re-enqueues it:
POST /api/v1/workflow-runs/:id/retryThe engine finds the failed step, resets its status to PENDING and its attempt count to 0, transitions the workflow back to PENDING, and enqueues the step for execution. The retry policy starts fresh.