ADR 0004: Thin Static Go Agent in the Worker Container¶
Status: Accepted Date: 2026-05-21
Context¶
The worker container must run two things:
- The user's task code (Python or Bash).
- A small piece of Leoflow code that communicates with the Control Plane, reports state and logs, and handles XCom.
The second piece is the agent. It needs a design.
Decision¶
The leoflow-agent is a statically linked Go binary, approximately 15 MB, that ships inside every official base image. It acts as the ENTRYPOINT of the container.
It does exactly this:
- Connects to the Control Plane over gRPC using credentials injected as environment variables.
- Receives the task spec (entrypoint, args, environment, XCom dependencies).
- Fetches required XCom values from Redis via the Control Plane.
- Spawns the user's process (
python -m,bash -c, etc.) with the injected environment. - Streams
stdoutandstderrback over gRPC (and optionally to a log sink). - Captures the return value and pushes it as XCom if declared.
- Reports final state and exits.
Rationale¶
- Static binary. No runtime dependencies. Drops into any base image. Works on glibc and musl alike.
- Tiny. 15 MB is negligible compared to a Python image. Does not affect pull time.
- Single-purpose. The agent is a runner, not a framework. There is no "leoflow framework" to import inside the user's Python.
- gRPC. Binary protocol, bidirectional streaming, much lower overhead than the HTTP+SQLAlchemy round-trips that Airflow does.
- Process model. The agent is
PID 1in the pod. It handles SIGTERM gracefully and propagates to the child process, allowing K8s to terminate cleanly.
Anti-Decisions¶
The agent does not:
- Parse DAG files. The DAG is already parsed; the agent receives a task spec.
- Have a CLI for users. Users never invoke
leoflow-agentdirectly. - Embed a Python interpreter. The base image provides Python; the agent spawns it.
- Cache anything beyond a single task execution. State lives in the Control Plane.
- Talk to the Postgres metadata database. All state goes through gRPC.
Consequences¶
- The gRPC protocol between Control Plane and Agent is a stable, versioned contract. Defined in
proto/agent.proto. - Authentication uses short-lived JWTs injected as environment variables at pod creation time. The agent presents this JWT on every gRPC call.
- The agent must handle Control Plane unavailability gracefully: retry with exponential backoff, then fail the task with a clear error after a deadline.
- The agent emits structured logs to
stdout, which the log shipper (Fluentbit sidecar or node-level agent) forwards. The agent itself does not push to S3/GCS โ that decoupling keeps it tiny.
Comparison to Airflow¶
Airflow's worker pod runs the full Airflow framework, which:
- Re-parses the DAG file on every task (CPU and memory cost).
- Imports every installed provider (potentially gigabytes of code).
- Maintains an SQLAlchemy session pool to Postgres.
- Uses Celery internals even in K8s mode.
Leoflow's agent does none of this. The container is the user's code plus a 15 MB sidecar, and nothing more.