Scheduler resilience¶
How Leoflow keeps the scheduler honest when something goes wrong: a process dies, an agent goes silent, a dispatch is lost in flight. The control plane ships three reapers โ small, single-purpose loops that turn stuck state back into observable terminal state, so the dashboard never lies about what's actually running.
This applies to both editions: Lite (single-process) and Pro (multi-replica with leader election). The reapers run only on the leader โ reaping writes state, and we want one writer across the fleet.
Recovery SLAs¶
| Failure mode | Detected by | Default SLA | What happens |
|---|---|---|---|
Task code wedged past its declared execution_timeout_seconds |
Agent itself (#194) | execution_timeout_seconds (per-task) |
TI failed with execution_timeout: task exceeded N. Retries kick in if budget remains. |
Agent process crashed mid-task (TI in running, no heartbeat) |
TI heartbeat reaper (#128) | 90 s | TI failed with agent_lost. Retries kick in if budget remains. |
Scheduler crashed before dispatching (TI stuck in queued) |
Dispatch-lost reaper (#202) | 3 min | TI failed with dispatch_lost. Frees the run for the orphan reaper on the next tick. |
Run stuck running with no active TIs (post-crash limbo) |
Orphan-run reaper (#120) | 5 min | Run failed with orphaned; any remaining active TIs flipped to failed. |
Worst case end-to-end: a mid-tick scheduler crash that leaves TIs queued is
fully reaped within max(3 min, 5 min) = 5 min โ the dispatch-lost reaper
runs first, then the orphan-run reaper picks up the now-no-active-TI run on
the next tick.
Tuning the thresholds¶
Defaults are conservative. For tighter recovery on a fast-failing workload, override via the scheduler interfaces:
sched.SetAgentLostThreshold(30 * time.Second)
sched.SetDispatchLostThreshold(1 * time.Minute)
sched.SetOrphanThreshold(2 * time.Minute)
For most users the defaults are correct: too-tight thresholds risk reaping a legitimately slow dispatch (Kubernetes pod-pull latency under contention) or a busy agent.
The "do no harm" rule¶
Each reaper requires a positive observable signal before failing anything:
- TI heartbeat reaper โ only fires on TIs that did heartbeat at least once and then went silent. A TI that never heartbeated (e.g. an inline http_api task with no agent) is left alone.
- Dispatch-lost reaper โ requires a non-zero
queued_atolder than the threshold. A TI without that stamp is too poorly observed to reap. - Orphan-run reaper โ requires
state = 'running'AND no active TI on the run. A run with any TI inscheduled/queued/runningis left alone (the dispatch-lost reaper unblocks this case by failing the stuck queued TIs first, so the next tick sees no active TIs).
This rule is the load-bearing invariant: the reapers can never kill a live execution (ADR 0031). The cost is that recovery is bounded by the slowest reaper that applies, not the fastest โ usually fine, sometimes worth tuning.
Observability¶
Each reap action is metered as a scheduler decision. Watch these labels in your Prometheus dashboard:
| Metric label | Meaning |
|---|---|
agent_lost |
TI failed by the heartbeat reaper |
dispatch_lost |
TI failed by the dispatch-lost reaper |
orphan_reaped |
Run failed by the orphan-run reaper |
agent_lost_list_error, dispatch_lost_list_error, orphan_list_error |
Reaper's list query failed; next tick will retry |
A sustained non-zero rate on any of these is worth investigating โ reapers are backstops, not the primary path; if they fire often, something upstream is broken.
What's NOT a scheduler concern¶
- Postgres unreachable โ the scheduler's
Heartbeat()goes unhealthy; the/monitor/healthendpoint surfaces it; runs queue up and resume when the DB returns. - Agent's task container OOM-killed โ surfaces as a non-zero exit code
through the agent (if it survived) or as
agent_lost(if the agent went with it). - K8s API outage โ pods stay where they are; new dispatches fail at the
executor layer (visible as
dispatch_failedmetric on the BufferedDispatcher); the dispatch-lost reaper does NOT fire (queued_at is fresh, the issue is the API, not the scheduler).
See also: ADR 0009 (leader election), ADR 0031 (scheduler architecture).