Durable Workflows in Kubernetes

When you enable State Persistence, Quarkus Flow stores enough information in the database to restore paused or abruptly ended workflows, resuming execution from the last successfully completed task.

However, to safely resume a workflow, the engine needs a deterministic "worker identity" (WorkflowApplication ID) so it knows which instance owns the execution.

In Kubernetes, Pod names and IPs are ephemeral. If a WorkflowApplication ID is derived from a Pod name, a rolling update or node drain will permanently destroy that ID, leaving the paused workflows "orphaned" in the database because no surviving worker claims that exact ID.

This guide explains how to solve this using Kubernetes Lease-based Coordination to maintain stable worker identities across pod disruptions.

1. The Architecture: Lease-Based Coordination

To decouple the workflow engine’s identity from the ephemeral Pod identity, Quarkus Flow binds the WorkflowApplication ID to a Kubernetes Lease name (e.g., flow-pool-member-durable-flow-00).

Unlike Pods, Leases are stable Kubernetes resources (coordination.k8s.io/v1).

The architecture works as follows:

The Leader: One Pod in the deployment acts as the leader. It monitors the Deployment replica count and creates an exact matching number of empty "Member Leases".
The Members: Every Pod in the deployment attempts to acquire exactly one Member Lease.
The Identity Binding: Once a Pod acquires a Lease, it sets its internal WorkflowApplication ID to that Lease’s name. It continuously sends heartbeats to renew the lease.
The Failover: If a Pod crashes, its heartbeat stops. The Lease expires, and when a new Pod spins up to replace the crashed one, it claims the abandoned Lease. The new Pod adopts the exact same WorkflowApplication ID, allowing it to seamlessly resume the orphaned workflows from the database.

2. Add the Durable Kubernetes Extension

To enable this architecture, add the following dependency to your pom.xml:

<dependency>
  <groupId>io.quarkiverse.flow</groupId>
  <artifactId>quarkus-flow-durable-kubernetes</artifactId>
</dependency>

(You will likely also want quarkus-kubernetes to auto-generate your deployment manifests).

3. Configure the Pool and Readiness Probes

Define the name of your Lease pool in application.properties.

Crucially, you should gate your Kubernetes Readiness Probe on the lease acquisition. If a Pod hasn’t acquired a lease, it doesn’t have an identity, and therefore shouldn’t receive network traffic or pull workflows from the database.

# The pool name used to generate the Lease resources
quarkus.flow.durable.kube.pool.name=durable-flow

# Gate the Pod's Readiness status on successfully acquiring a lease
quarkus.flow.durable.kube.health.readiness.require-lease=true

4. Configure Kubernetes Manifests

4.1 Expose Pod Identity (Downward API)

The lease mechanism uses the physical Pod name as the holderIdentity inside the Lease resource. You must expose this to the application using the Kubernetes Downward API.

If you write your own YAML manifests, ensure your container environment variables include:

env:
  - name: POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: POD_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace

4.2 Configure RBAC Permissions

Your Pods need permission to read Deployments (to count replicas) and manage Leases.

If you use Quarkus to auto-generate your Kubernetes manifests, you can automatically generate the correct ServiceAccount, Role, and RoleBinding by adding these properties:

# Create and use a custom ServiceAccount
quarkus.kubernetes.rbac.service-accounts.durable-flow-sa.use-as-default=true

# Allow managing Leases
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.0.api-groups=coordination.k8s.io
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.0.resources=leases
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.0.verbs=get,list,watch,create,update,patch,delete

# Allow reading Pods
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.1.api-groups=
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.1.resources=pods
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.1.verbs=get,list,watch

# Allow reading Deployments and ReplicaSets
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.2.api-groups=apps
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.2.resources=deployments,replicasets
quarkus.kubernetes.rbac.roles.durable-flow-sa.policy-rules.2.verbs=get,list,watch

# Bind the Role to the ServiceAccount
quarkus.kubernetes.rbac.role-bindings.durable-flow-sa.subjects.durable-flow-sa.kind=ServiceAccount
quarkus.kubernetes.rbac.role-bindings.durable-flow-sa.role-name=durable-flow-sa

5. Configure Deployment Strategy (Deadlock Prevention)

Because Quarkus Flow relies on exclusive Leases, it fundamentally conflicts with Kubernetes' default RollingUpdate behavior. By default, Kubernetes starts a new Pod (V2) before killing the old Pod (V1). Because V2 requires a Lease to become Ready, and V1 still holds that Lease, the deployment will deadlock.

To prevent this, you must configure your deployment to terminate old Pods before starting new ones.

Single Replica Deployments (`replicas: 1`)

If you are running a single instance, you must change the deployment strategy to Recreate. This immediately terminates the old Pod, freeing the lease so the new Pod can successfully acquire it on startup.

Note: This will result in a brief period of downtime during deployments while the old pod shuts down and the new one boots up.

If you use the Quarkus Kubernetes extension, set this in your application.properties:

# Changes the strategy from RollingUpdate to Recreate
quarkus.kubernetes.strategy=recreate

If you write your own YAML:

spec:
  replicas: 1
  strategy:
    type: Recreate

Multi-Replica Deployments (`replicas: > 1`)

If you are running multiple replicas, you can keep the RollingUpdate strategy, but you must set maxUnavailable to at least 1. This instructs Kubernetes to kill an old Pod first, releasing its lease for the new Pod to claim.

If you use the Quarkus Kubernetes extension, set this in your application.properties:

# Ensures one pod is killed first to release the lease
quarkus.kubernetes.rolling-update.max-unavailable=1
quarkus.kubernetes.rolling-update.max-surge=1

If you write your own YAML:

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

6. Verify the Architecture in Production

Once deployed, you can use kubectl to verify the leader election and sharding are working correctly.

View the Leases

kubectl get lease -l io.quarkiverse.flow.durable.k8s/pool=durable-flow -o wide

You should see one leader lease, and exactly as many member leases as you have Pod replicas.

Check Readiness

Execute a health check directly against a running Pod:

kubectl exec -it deploy/my-flow-app -- curl -s localhost:8080/q/health/ready

If configured correctly, the JSON output will include "leaseAcquired": true and "leaseName": "flow-pool-member-durable-flow-00".

Test Failover

To see the durability in action, manually delete a Pod holding a lease:

# Find which Pod holds lease '00'
POD=$(kubectl get lease flow-pool-member-durable-flow-00 -o jsonpath='{.spec.holderIdentity}')

# Delete that Pod
kubectl delete pod "$POD"

# Watch the lease seamlessly transfer to the new replacement Pod
kubectl get lease flow-pool-member-durable-flow-00 -w