Skip links
Domino chain interrupted by one standing domino with protective shield

Building Durable Workflows: Surviving Crashes and Deploys

A durable workflow is a sequence of steps that completes exactly once, even if the process executing it crashes, the server restarts, or a deployment rolls out mid-execution. This sounds like a niche concern until you inventory the workflows running in any production system: payment processing, order fulfillment, user onboarding sequences, data pipeline orchestration, email drip campaigns. Every one of these breaks catastrophically when a step executes twice or fails to execute at all because the process died between steps. At Harbor Software, we migrated our critical workflows to durable execution over the past year, and the reliability improvement was dramatic — our “stuck workflow” incidents dropped from 12 per month to zero.

Article Overview

Building Durable Workflows: Surviving Crashes and Deploys

8 sections · Reading flow

01
The Problem with Regular Code
02
Temporal: Durable Execution That Actually Works
03
Activities: The Boundary Between Deterministic…
04
The Determinism Constraint: What You Cannot Do…
05
Inngest: Durable Workflows Without Infrastructure
06
Choosing Between Temporal and Inngest
07
Error Handling: Sagas and Compensating Actions
08
Monitoring Durable Workflows

HARBOR SOFTWARE · Engineering Insights

The Problem with Regular Code

Consider a simple order fulfillment workflow: charge the customer’s card, reserve inventory, send a confirmation email, and schedule shipping. In regular code, this looks straightforward:

async function fulfillOrder(orderId: string) {
  const order = await getOrder(orderId);
  const charge = await chargeCard(order.customerId, order.total);
  await reserveInventory(order.items);
  await sendConfirmationEmail(order.customerId, order.id);
  await scheduleShipping(order.id, order.items);
  await markOrderComplete(order.id);
}

This code has four failure modes that will burn you in production:

  1. Process crash after charging but before reserving inventory. The customer is charged, but their order is not fulfilled. Manual intervention required.
  2. Deploy during inventory reservation. The old process is killed. Did reserveInventory complete? You do not know. Running the workflow again might double-reserve.
  3. Email service timeout. sendConfirmationEmail times out after 30 seconds. Did the email actually send? The SMTP server might have accepted it before your client timed out. Retrying might send a duplicate email.
  4. Shipping API returns 500. You retry the entire workflow, which charges the customer again because you have no record of the successful charge from the crashed process.

Every team encounters these problems eventually. The common workaround is status columns in the database: payment_status, inventory_status, email_sent, shipping_scheduled. The workflow checks each status before executing each step. This works — sort of — but it requires every workflow to implement its own state tracking, retry logic, and recovery procedures. After building this pattern manually for the fifth time, we looked for a better solution.

The manual approach also has a subtle correctness problem: the status update and the step execution are not atomic. If the process crashes after reserveInventory succeeds but before UPDATE orders SET inventory_status = 'reserved' commits, the status column says inventory is not reserved, but it actually is. When the workflow retries, it reserves inventory again. You can solve this with database transactions, but only if the step’s side effect (the API call) and the status update target the same database. If they do not — and they usually do not — you are back to the same race condition.

Temporal: Durable Execution That Actually Works

We evaluated three durable workflow engines: Temporal, Inngest, and AWS Step Functions. We chose Temporal for our core workflows because it offers the most control over execution semantics and runs self-hosted (important for our data residency requirements). Inngest is excellent for simpler event-driven workflows and has a gentler learning curve. Step Functions are solid if you are all-in on AWS but the JSON-based state machine definition becomes unwieldy for complex workflows with branching and error handling.

Temporal’s model is elegant: your workflow code looks like regular TypeScript (or Go, Python, Java), but the Temporal server records every step’s input and output in an event history. If the process crashes, Temporal replays the workflow from its recorded history, skipping steps that already completed by returning their recorded results. The workflow function is deterministic — given the same history, it produces the same execution path.

// workflows.ts — Temporal workflow
import { proxyActivities, sleep } from '@temporalio/workflow';
import type * as activities from './activities';

const {
  chargeCard,
  reserveInventory,
  sendConfirmationEmail,
  scheduleShipping,
  markOrderComplete,
  refundCharge
} = proxyActivities<typeof activities>({
  startToCloseTimeout: '30s',
  retry: {
    maximumAttempts: 3,
    initialInterval: '1s',
    backoffCoefficient: 2
  }
});

export async function fulfillOrder(orderId: string, order: Order) {
  // Step 1: Charge the card
  // If this succeeds and the process crashes, Temporal will NOT re-run it
  const charge = await chargeCard(order.customerId, order.total);

  try {
    // Step 2: Reserve inventory
    await reserveInventory(order.items);

    // Step 3: Send confirmation email
    await sendConfirmationEmail(order.customerId, orderId);

    // Step 4: Schedule shipping
    await scheduleShipping(orderId, order.items);

    // Step 5: Mark complete
    await markOrderComplete(orderId);
  } catch (err) {
    // Compensating action: refund the charge if any step after payment fails
    await refundCharge(charge.id);
    throw err;
  }
}

The critical difference: if the process crashes after chargeCard succeeds, Temporal replays the workflow and returns the recorded result of chargeCard without calling it again. The card is charged exactly once. The try/catch block implements the saga pattern — if inventory reservation, email, or shipping fails after all retries are exhausted, we refund the charge as a compensating action. This is not just error handling; it is a transaction across multiple services that guarantees the system reaches a consistent state regardless of where failures occur.

Activities: The Boundary Between Deterministic and Non-Deterministic

Temporal distinguishes between workflow code (deterministic, replayed on recovery) and activities (non-deterministic side effects like API calls, database writes, and file operations). This distinction is the core concept you need to internalize. Workflow code is the orchestration logic — it decides what happens next. Activities are the things that actually happen — the API calls, the database writes, the emails sent.

// activities.ts — these run ONCE, results are recorded
import Stripe from 'stripe';
import { sendgrid } from './email-client';

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);

export async function chargeCard(
  customerId: string,
  amount: number
): Promise<{ id: string; status: string }> {
  // This code runs exactly once, even if the workflow is replayed 100 times
  const charge = await stripe.charges.create({
    amount: Math.round(amount * 100),
    currency: 'usd',
    customer: customerId,
    idempotencyKey: `charge-${customerId}-${amount}-${Date.now()}`
  });
  return { id: charge.id, status: charge.status };
}

export async function sendConfirmationEmail(
  customerId: string,
  orderId: string
): Promise<void> {
  await sendgrid.send({
    to: await getCustomerEmail(customerId),
    template: 'order-confirmation',
    data: { orderId }
  });
}

export async function reserveInventory(
  items: OrderItem[]
): Promise<void> {
  for (const item of items) {
    const result = await db.execute(
      'UPDATE inventory SET reserved = reserved + $1 WHERE sku = $2 AND available >= $1',
      [item.quantity, item.sku]
    );
    if (result.rowCount === 0) {
      throw new InsufficientInventoryError(item.sku, item.quantity);
    }
  }
}

Each activity has its own retry policy. We configure chargeCard with 1 retry (payment APIs are designed to be idempotent with idempotency keys, and retrying more than once risks hitting rate limits), sendConfirmationEmail with 3 retries (email delivery is inherently unreliable and services recover quickly from transient failures), and reserveInventory with 2 retries (database operations usually fail transiently from connection pool exhaustion or brief network blips).

The retry policy is configured at the proxyActivities level in the workflow, not in the activity itself. This is important because it separates the retry policy (an orchestration concern) from the implementation (a business logic concern). Different workflows can use the same activity with different retry policies depending on their reliability requirements.

The Determinism Constraint: What You Cannot Do in Workflow Code

Because Temporal replays workflow code to recover state, the workflow function must be deterministic. This means you cannot use non-deterministic constructs inside workflow code. This constraint catches every new Temporal developer at least once:

// WRONG — non-deterministic in workflow code
export async function myWorkflow() {
  const now = new Date();           // Different on replay!
  const id = crypto.randomUUID();   // Different on replay!
  const data = await fetch('...');  // Side effect, not recorded!
  if (Math.random() > 0.5) { ... }  // Different on replay!
}

// RIGHT — use Temporal's deterministic alternatives
import { sleep, uuid4 } from '@temporalio/workflow';

export async function myWorkflow() {
  // For UUIDs, use Temporal's deterministic UUID generator
  const id = uuid4();

  // For external data, use an activity
  const data = await fetchData(); // Activity — result recorded and replayed

  // For delays, use Temporal's sleep (survives process restarts)
  await sleep('24 hours'); // Process can crash; timer persists in Temporal server

  // For time-based decisions, use the workflow's start time
  // which is constant across replays
}

The sleep('24 hours') is one of the most powerful features. In regular code, a 24-hour sleep is destroyed when the process restarts. In Temporal, the timer is stored on the Temporal server. The workflow function resumes 24 hours later, even if the worker process has been restarted, redeployed, or moved to a different machine. We use this for drip email campaigns (send welcome email, wait 3 days, send onboarding guide, wait 7 days, send feature highlight), trial expiration workflows (start trial, wait 14 days, check conversion, send expiration warning), and scheduled report generation.

The determinism constraint also means you cannot use Date.now(), Math.random(), or any function that produces different output on different invocations. Temporal’s TypeScript SDK enforces this by running workflow code in a sandboxed V8 isolate that intercepts non-deterministic calls. If you accidentally use Date.now(), the sandbox throws a clear error during development rather than silently producing a bug in production during replay.

Inngest: Durable Workflows Without Infrastructure

Temporal is powerful but requires running the Temporal server (either self-hosted or Temporal Cloud at $200+/month). For simpler workflows where you do not want to manage additional infrastructure, Inngest provides durable execution as a service with a serverless model:

// Inngest — durable workflow without managing infrastructure
import { Inngest } from 'inngest';

const inngest = new Inngest({ id: 'harbor-app' });

export const fulfillOrder = inngest.createFunction(
  { id: 'fulfill-order', retries: 3 },
  { event: 'order/placed' },
  async ({ event, step }) => {
    const order = event.data;

    // Each step.run() call is individually retried and recorded
    const charge = await step.run('charge-card', async () => {
      return await stripe.charges.create({
        amount: order.total * 100,
        currency: 'usd',
        customer: order.customerId
      });
    });

    await step.run('reserve-inventory', async () => {
      await reserveInventory(order.items);
    });

    // Wait 2 minutes before sending confirmation (order review period)
    await step.sleep('order-review-delay', '2 minutes');

    await step.run('send-confirmation', async () => {
      await sendConfirmationEmail(order.customerId, order.id);
    });

    await step.run('schedule-shipping', async () => {
      await scheduleShipping(order.id, order.items);
    });
  }
);

Inngest’s step.run() and step.sleep() provide the same durable execution guarantees as Temporal’s activities and sleep, but without running any additional infrastructure. Inngest calls your serverless function (Vercel, AWS Lambda, Netlify, etc.) once per step, and the function returns after each step completes. The next step is executed in a new function invocation. This model works naturally with serverless platforms where functions have short execution limits.

The tradeoff is latency between steps. Each step involves an HTTP call from Inngest’s platform to your function, which adds 100-500ms of overhead per step. For a workflow with 5 steps, that is 500ms-2.5s of overhead that Temporal does not have (Temporal steps execute in the same process with microsecond overhead between them). For most business workflows (order processing, onboarding, notifications), this overhead is imperceptible. For high-frequency, latency-sensitive workflows (real-time data processing, interactive applications), it matters.

Choosing Between Temporal and Inngest

After running both in production for over a year, our decision framework is straightforward:

Use Temporal when:

  • Workflows have complex branching, loops, or child workflows
  • You need sub-second step latency (Inngest has ~100-500ms overhead per step)
  • You need to query workflow state in real time (Temporal has a powerful query API)
  • Your team can manage the Temporal server infrastructure (or budget $200+/month for Temporal Cloud)
  • Workflows run for hours or days with many steps (50+ steps per workflow)
  • You need to signal running workflows to change behavior (Temporal signals are unique and powerful)

Use Inngest when:

  • Workflows are linear or have simple branching
  • You want zero infrastructure management
  • You deploy on serverless platforms (Vercel, Lambda, etc.)
  • Step latency of 100-500ms is acceptable
  • You want to get started in an afternoon rather than a week
  • Your workflows have fewer than 20 steps

We use Temporal for our payment processing, model training orchestration, and data pipeline workflows (complex, many steps, latency-sensitive). We use Inngest for user onboarding sequences, notification workflows, and scheduled report generation (linear, few steps, latency-tolerant). The dividing line is complexity and latency sensitivity.

Error Handling: Sagas and Compensating Actions

Durable workflows do not have the luxury of database transactions that span multiple services. When step 3 of 5 fails, you cannot roll back steps 1 and 2 with a ROLLBACK statement because those steps executed against external systems (payment processor, inventory service, email provider) that do not participate in your database transaction. Instead, you implement compensating actions — reverse operations that undo the effects of previously completed steps.

This is the saga pattern, and it is the standard approach for maintaining consistency across distributed systems. Each step in the workflow has a corresponding compensating action:

// Saga pattern — each step has a compensating action
const sagaSteps = [
  {
    name: 'charge-card',
    execute: () => chargeCard(order.customerId, order.total),
    compensate: (result) => refundCharge(result.chargeId)
  },
  {
    name: 'reserve-inventory',
    execute: () => reserveInventory(order.items),
    compensate: () => releaseInventory(order.items)
  },
  {
    name: 'send-confirmation',
    execute: () => sendConfirmationEmail(order.customerId, order.id),
    compensate: () => sendCancellationEmail(order.customerId, order.id)
  },
  {
    name: 'schedule-shipping',
    execute: () => scheduleShipping(order.id, order.items),
    compensate: (result) => cancelShipment(result.shipmentId)
  }
];

When a step fails after all retries are exhausted, the workflow executes compensating actions for all previously completed steps in reverse order. This is not a perfect rollback — there is a window where the system is in an inconsistent state (card charged but order not fulfilled), and some compensating actions are imperfect (you cannot un-send an email, only send a cancellation email). But it is dramatically better than the alternative of manual intervention, which is what happens without durable workflows.

The order of compensation matters. If the charge and inventory reservation both succeeded but the email failed, you need to release inventory before refunding the charge. Why? Because releasing inventory is idempotent (releasing already-released inventory is a no-op), but refunding a charge is not (refunding an already-refunded charge will error or double-refund). By compensating in reverse order, you handle the most fragile operations last, when you have the most context about what needs to be undone.

Some compensating actions are impossible to implement perfectly. You cannot un-send a notification, un-publish a social media post (well, you can delete it, but people may have already seen it), or un-trigger a physical process (a warehouse worker may have already picked the items). For these cases, the compensating action is a best-effort approximation: send a correction notification, delete the post, or create a return shipment. The key insight is that imperfect compensation is still vastly better than no compensation, which leaves the system in a silently inconsistent state that requires manual investigation to resolve.

Monitoring Durable Workflows

Durable workflows introduce a new category of operational concern: workflow health. You need to monitor not just “is the service up” but “are workflows progressing, completing, and not getting stuck.” We track four metrics and alert on specific thresholds:

  • Workflow completion rate: Percentage of started workflows that reach a terminal state (completed or failed) within their expected duration. Target: >99%. Alert at <98%.
  • Step failure rate: Percentage of individual step executions that fail (before retries succeed). A rising step failure rate usually indicates a downstream service degradation. Alert at >5% over a 15-minute window.
  • Workflow duration p95: How long workflows take to complete. A spike here means a step is slow or a retry loop is burning time. Alert at >2x the historical baseline.
  • Active workflow count: How many workflows are currently running. A growing count with a stable start rate means workflows are not completing. Alert when count exceeds 2x the normal baseline for more than 10 minutes.

Both Temporal and Inngest provide built-in dashboards for these metrics. Temporal’s Web UI shows workflow execution history with step-by-step timing, which is invaluable for debugging stuck workflows — you can see exactly which step is pending, how many retries it has attempted, and what error it received. Inngest’s dashboard shows function run history with per-step logs and automatic alerting on failure rates. For production alerting beyond the built-in dashboards, we export metrics to Datadog and set up PagerDuty integrations for the critical alerts.

Durable workflows transform unreliable multi-step processes into reliable, observable, recoverable systems. The initial learning curve is real — understanding determinism constraints, designing idempotent activities, and structuring compensating actions takes practice. But the alternative — manually implementing state tracking, retry logic, and recovery procedures for every multi-step process — is more expensive in engineering time and far more expensive in production incidents. Start with one critical workflow, prove the model, and expand from there.

Leave a comment

Explore
Drag