Skip to content

Scheduler resilience

How Leoflow keeps the scheduler honest when something goes wrong: a process dies, an agent goes silent, a dispatch is lost in flight. The control plane ships three reapers โ€” small, single-purpose loops that turn stuck state back into observable terminal state, so the dashboard never lies about what's actually running.

This applies to both editions: Lite (single-process) and Pro (multi-replica with leader election). The reapers run only on the leader โ€” reaping writes state, and we want one writer across the fleet.

Recovery SLAs

Failure mode Detected by Default SLA What happens
Task code wedged past its declared execution_timeout_seconds Agent itself (#194) execution_timeout_seconds (per-task) TI failed with execution_timeout: task exceeded N. Retries kick in if budget remains.
Agent process crashed mid-task (TI in running, no heartbeat) TI heartbeat reaper (#128) 90 s TI failed with agent_lost. Retries kick in if budget remains.
Scheduler crashed before dispatching (TI stuck in queued) Dispatch-lost reaper (#202) 3 min TI failed with dispatch_lost. Frees the run for the orphan reaper on the next tick.
Run stuck running with no active TIs (post-crash limbo) Orphan-run reaper (#120) 5 min Run failed with orphaned; any remaining active TIs flipped to failed.

Worst case end-to-end: a mid-tick scheduler crash that leaves TIs queued is fully reaped within max(3 min, 5 min) = 5 min โ€” the dispatch-lost reaper runs first, then the orphan-run reaper picks up the now-no-active-TI run on the next tick.

Tuning the thresholds

Defaults are conservative. For tighter recovery on a fast-failing workload, override via the scheduler interfaces:

sched.SetAgentLostThreshold(30 * time.Second)
sched.SetDispatchLostThreshold(1 * time.Minute)
sched.SetOrphanThreshold(2 * time.Minute)

For most users the defaults are correct: too-tight thresholds risk reaping a legitimately slow dispatch (Kubernetes pod-pull latency under contention) or a busy agent.

The "do no harm" rule

Each reaper requires a positive observable signal before failing anything:

  • TI heartbeat reaper โ€” only fires on TIs that did heartbeat at least once and then went silent. A TI that never heartbeated (e.g. an inline http_api task with no agent) is left alone.
  • Dispatch-lost reaper โ€” requires a non-zero queued_at older than the threshold. A TI without that stamp is too poorly observed to reap.
  • Orphan-run reaper โ€” requires state = 'running' AND no active TI on the run. A run with any TI in scheduled/queued/running is left alone (the dispatch-lost reaper unblocks this case by failing the stuck queued TIs first, so the next tick sees no active TIs).

This rule is the load-bearing invariant: the reapers can never kill a live execution (ADR 0031). The cost is that recovery is bounded by the slowest reaper that applies, not the fastest โ€” usually fine, sometimes worth tuning.

Observability

Each reap action is metered as a scheduler decision. Watch these labels in your Prometheus dashboard:

Metric label Meaning
agent_lost TI failed by the heartbeat reaper
dispatch_lost TI failed by the dispatch-lost reaper
orphan_reaped Run failed by the orphan-run reaper
agent_lost_list_error, dispatch_lost_list_error, orphan_list_error Reaper's list query failed; next tick will retry

A sustained non-zero rate on any of these is worth investigating โ€” reapers are backstops, not the primary path; if they fire often, something upstream is broken.

What's NOT a scheduler concern

  • Postgres unreachable โ€” the scheduler's Heartbeat() goes unhealthy; the /monitor/health endpoint surfaces it; runs queue up and resume when the DB returns.
  • Agent's task container OOM-killed โ€” surfaces as a non-zero exit code through the agent (if it survived) or as agent_lost (if the agent went with it).
  • K8s API outage โ€” pods stay where they are; new dispatches fail at the executor layer (visible as dispatch_failed metric on the BufferedDispatcher); the dispatch-lost reaper does NOT fire (queued_at is fresh, the issue is the API, not the scheduler).

See also: ADR 0009 (leader election), ADR 0031 (scheduler architecture).