ADR 0010: Observability Stack from Day One¶

Status: Accepted Date: 2026-05-21

Context¶

One of Leoflow's stated differentiators is native observability, which Airflow famously lacks. This must be backed by actual implementation from the first commit, not bolted on later.

Decision¶

Every Leoflow binary ships with three pillars of observability built in:

Metrics — Prometheus, exposed at /metrics.
Tracing — OpenTelemetry, exporting via OTLP to any compatible collector (Jaeger, Tempo, Honeycomb, Datadog).
Logs — Structured JSON via log/slog, with consistent fields (trace_id, span_id, dag_id, task_id, run_id).

All three are wired together via OpenTelemetry context propagation, so a log line, a trace span, and a metric label can be correlated.

Required Metrics for the MVP¶

The following metrics must exist before the MVP is considered complete. They are the contract Leoflow makes with operators.

Scheduler¶

Metric	Type	Labels
`leoflow_scheduler_loop_duration_seconds`	Histogram	—
`leoflow_scheduler_decisions_total`	Counter	`decision_type` (schedule/skip/defer)
`leoflow_scheduler_leader`	Gauge	`replica_id`
`leoflow_active_dag_runs`	Gauge	`dag_id`, `state`
`leoflow_queued_tasks`	Gauge	`dag_id`

Task Lifecycle¶

Metric	Type	Labels
`leoflow_task_state_transitions_total`	Counter	`from_state`, `to_state`, `dag_id`
`leoflow_task_duration_seconds`	Histogram	`dag_id`, `task_id`, `task_type`
`leoflow_task_retries_total`	Counter	`dag_id`, `task_id`
`leoflow_task_pod_creation_duration_seconds`	Histogram	—
`leoflow_task_cold_start_seconds`	Histogram	`dag_id`

XCom¶

Metric	Type	Labels
`leoflow_xcom_size_bytes`	Histogram	`dag_id`
`leoflow_xcom_push_total`	Counter	`dag_id`
`leoflow_xcom_pull_total`	Counter	`dag_id`
`leoflow_xcom_rejected_total`	Counter	`reason` (too_large/schema_mismatch/expired)

API¶

Metric	Type	Labels
`leoflow_http_requests_total`	Counter	`method`, `path`, `status`
`leoflow_http_request_duration_seconds`	Histogram	`method`, `path`
`leoflow_auth_failures_total`	Counter	`reason`

Executor (K8s)¶

Metric	Type	Labels
`leoflow_pods_created_total`	Counter	`dag_id`, `result` (success/error)
`leoflow_pods_running`	Gauge	—
`leoflow_pod_pending_duration_seconds`	Histogram	—
`leoflow_kubernetes_api_calls_total`	Counter	`operation`, `result`

Tracing¶

Every task instance gets a root span with these attributes:

leoflow.dag_id
leoflow.task_id
leoflow.run_id
leoflow.try_number

Child spans:

scheduler.decision
executor.create_pod
agent.fetch_xcom
agent.execute_user_code
agent.push_xcom

The trace continues across the gRPC boundary between the Control Plane and the Agent via standard OTel context propagation.

Logs¶

Every log line is JSON, written to stdout. Common fields:

{
  "time": "2026-05-21T14:23:11.482Z",
  "level": "INFO",
  "msg": "task transitioned to RUNNING",
  "trace_id": "abc123...",
  "span_id": "def456...",
  "dag_id": "etl_vendas",
  "task_id": "extract",
  "run_id": "scheduled__2026-05-21",
  "try_number": 1,
  "tenant_id": "default"
}

No human-readable formatters in production builds. JSON only. Operators use jq or log aggregators.

Health Checks¶

Every binary exposes:

/healthz — liveness. Returns 200 if the process is alive.
/readyz — readiness. Returns 200 only if dependencies (Postgres, Redis) are reachable.

K8s deployments use these for liveness and readiness probes.

Consequences¶

The dependency footprint grows. client_golang, go.opentelemetry.io/otel, and slog are mandatory.
Performance overhead is real but small. Metrics and traces are cheap when batched. Logs at INFO level are negligible.
The CI pipeline must validate that every new metric is registered with a description and that no metric explodes label cardinality.
Operators get a Grafana dashboard template shipped with the project (in helm/dashboards/).

Alternatives Rejected¶

Add observability later: rejected because retrofitting tracing across an existing codebase is enormously expensive.
Only logs, no metrics or traces: rejected because logs alone cannot answer "is the system slow right now?".
Custom metrics format: rejected because Prometheus is the industry standard.