ADR 0022: Ephemeral Per-DAG-Run Staging Volume¶
Status: Accepted Date: 2026-05-23 Deciders: Project founder
Context¶
XCom is for passing small typed values between tasks (โค256KB, Redis-backed); it is not data storage. File-heavy pipelines need to stage large intermediate data between a run's tasks (extract writes GBs, transform reads them). The options considered (see issue #55 and the chat discussion):
- A global
/tmpvolume mounted on every pod โ rejected: shared mutable state across all runs/DAGs, no isolation, no GC story, contradicts the stateless-pod model (and ADR's "no shared filesystem"). - Object storage (S3/GCS) via a cloud connection โ the right answer for durable / cross-run / cross-DAG data, but needs a cloud account + the Connections runtime wiring (ADR 0021 / #54), which we do not have yet. MinIO is no longer a comfortable bundled default.
- An ephemeral, per-DAG-run shared volume, managed by Leoflow โ fills the gap for in-run scratch with no cloud dependency.
How Airflow solves this in 3.x: it offers airflow.io.ObjectStoragePath (fsspec
+ a Connection) for object storage, and a custom XCom backend to offload large
values โ but it has no first-class ephemeral shared volume; PVCs are wired
manually via pod templates with no managed lifecycle or isolation. So a managed
per-run volume is a Leoflow value-add.
A hard requirement surfaced in design: a re-run must not lose the temporary data produced by already-successful upstream tasks, or the re-executed task fails. The volume's lifecycle must therefore be tied to the DAG run, not to a task pod, and survive retries and clear+re-run.
Decision¶
Add an opt-in, Leoflow-managed ReadWriteMany volume scoped to each DAG run.
- Opt-in, declared in
leoflow.yaml(compiled into the immutabledag.json): - One PVC per run, deterministically named
leoflow-staging-<dag_id>-<run_id>, mounted at/stagingin every task pod of that run, exposed asLEOFLOW_STAGING_DIR. The deterministic name is what lets a clear+re-run re-attach the same PVC. - Lifecycle tied to the DAG run, not the pod:
- Created when the run leaves
queued. - Persists across task retries and clear+re-run โ a re-run re-attaches the same PVC, so upstream outputs are still present and the task does not fail for missing data.
- Garbage-collected only after the run reaches a terminal state and a 24-hour TTL elapses (so a re-run shortly after a failure still finds the data). A reconciler sweeps orphaned PVCs, mirroring the existing pod reconciler.
- Requires an RWX StorageClass. If
staging.enabledis set and the cluster has no RWX class, fail fast with a clear error at dispatch โ never silently degrade. - Isolation, not transactions. "Atomic per DAG" means the volume is a single isolated unit per run; it is a shared filesystem, not ACID. Tasks are responsible for atomic writes (write-temp-then-rename). This is documented.
Default is off: XCom for small handoffs, object storage for durable data.
Consequences¶
- File-heavy pipelines get real in-run staging with no cloud account, isolated per run, with a clear managed lifecycle โ closing the gap Airflow leaves to manual pod-template wiring.
- Re-runs are safe: the per-run PVC and the 24h post-terminal TTL guarantee upstream temp data survives retries and clears.
- New operational surface: PVC provisioning latency/cost, a GC reconciler, and an RWX StorageClass dependency (not universal). Co-locating a run's pods may constrain K8s scheduling.
- Two data paths coexist by design: ephemeral per-run (this ADR) and durable object storage (future, via Connections). The volume is not for cross-run or cross-DAG data.
Alternatives considered¶
- Global
/tmpRWX volume. Rejected: shared mutable state, no isolation, no GC, contradicts stateless pods. - Object storage only (S3/GCS via connection). The durable answer and still recommended for persistent/cross-run data, but requires a cloud account and the Connections runtime wiring (#54); does not cover the no-cloud, in-run scratch case. Tracked separately.
- Pod-template PVC (Airflow style). Rejected as the default: no managed lifecycle, isolation, or re-run safety โ exactly the gaps this ADR closes.
- Delete the volume on run terminal (no TTL). Rejected: a clear+re-run right after a failure would lose the data; the 24h TTL is the safety margin.
Amendment (2026-05-24): success frees immediately; DB-tracked lifecycle¶
Two refinements after exercising staging in dev, agreed by the project founder. Both align the implementation with this ADR's stated rationale rather than reversing it.
-
The TTL is the failure-recovery margin, so it applies only to failed runs. This ADR justified the 24h TTL entirely by the failure case ("so a re-run shortly after a failure still finds the data"; the rejected "delete on terminal" alternative was rejected because a clear+re-run after a failure loses data). A successful run has no such need โ re-running an already-successful run is rare โ so its volume is now freed immediately at run end. A failed run keeps its volume for the 24h post-terminal TTL (measured from the run's terminal time, not PVC creation). Within-run retries are unaffected (the volume lives for the whole run). The accepted trade-off, documented for users: a clear+re-run of an already-succeeded run, or a failed run after its TTL, re-runs all tasks.
-
The volume lifecycle is tracked in the metadatabase (
staging_volumestable) as the source of truth, instead of being inferred from Kubernetes object state. Each volume hasstate(activeโdeleted),created_at,deleted_at, anddelete_reason(run_succeeded|ttl_expired|orphaned), so the lifecycle is auditable โ when and why each disk was reclaimed. The GC reads the active set joined with each run's state and reclaims accordingly (succeeded now, failed after TTL, orphaned on sight), recording the reason. The PVC is identified by its deterministic name (leoflow-staging-<dag_id>-<run_id>), unique per namespace.