ADR 0010: Observability Stack from Day One¶
Status: Accepted Date: 2026-05-21
Context¶
One of Leoflow's stated differentiators is native observability, which Airflow famously lacks. This must be backed by actual implementation from the first commit, not bolted on later.
Decision¶
Every Leoflow binary ships with three pillars of observability built in:
- Metrics โ Prometheus, exposed at
/metrics. - Tracing โ OpenTelemetry, exporting via OTLP to any compatible collector (Jaeger, Tempo, Honeycomb, Datadog).
- Logs โ Structured JSON via
log/slog, with consistent fields (trace_id,span_id,dag_id,task_id,run_id).
All three are wired together via OpenTelemetry context propagation, so a log line, a trace span, and a metric label can be correlated.
Required Metrics for the MVP¶
The following metrics must exist before the MVP is considered complete. They are the contract Leoflow makes with operators.
Scheduler¶
| Metric | Type | Labels |
|---|---|---|
leoflow_scheduler_loop_duration_seconds |
Histogram | โ |
leoflow_scheduler_decisions_total |
Counter | decision_type (schedule/skip/defer) |
leoflow_scheduler_leader |
Gauge | replica_id |
leoflow_active_dag_runs |
Gauge | dag_id, state |
leoflow_queued_tasks |
Gauge | dag_id |
Task Lifecycle¶
| Metric | Type | Labels |
|---|---|---|
leoflow_task_state_transitions_total |
Counter | from_state, to_state, dag_id |
leoflow_task_duration_seconds |
Histogram | dag_id, task_id, task_type |
leoflow_task_retries_total |
Counter | dag_id, task_id |
leoflow_task_pod_creation_duration_seconds |
Histogram | โ |
leoflow_task_cold_start_seconds |
Histogram | dag_id |
XCom¶
| Metric | Type | Labels |
|---|---|---|
leoflow_xcom_size_bytes |
Histogram | dag_id |
leoflow_xcom_push_total |
Counter | dag_id |
leoflow_xcom_pull_total |
Counter | dag_id |
leoflow_xcom_rejected_total |
Counter | reason (too_large/schema_mismatch/expired) |
API¶
| Metric | Type | Labels |
|---|---|---|
leoflow_http_requests_total |
Counter | method, path, status |
leoflow_http_request_duration_seconds |
Histogram | method, path |
leoflow_auth_failures_total |
Counter | reason |
Executor (K8s)¶
| Metric | Type | Labels |
|---|---|---|
leoflow_pods_created_total |
Counter | dag_id, result (success/error) |
leoflow_pods_running |
Gauge | โ |
leoflow_pod_pending_duration_seconds |
Histogram | โ |
leoflow_kubernetes_api_calls_total |
Counter | operation, result |
Tracing¶
Every task instance gets a root span with these attributes:
leoflow.dag_idleoflow.task_idleoflow.run_idleoflow.try_number
Child spans:
scheduler.decisionexecutor.create_podagent.fetch_xcomagent.execute_user_codeagent.push_xcom
The trace continues across the gRPC boundary between the Control Plane and the Agent via standard OTel context propagation.
Logs¶
Every log line is JSON, written to stdout. Common fields:
{
"time": "2026-05-21T14:23:11.482Z",
"level": "INFO",
"msg": "task transitioned to RUNNING",
"trace_id": "abc123...",
"span_id": "def456...",
"dag_id": "etl_vendas",
"task_id": "extract",
"run_id": "scheduled__2026-05-21",
"try_number": 1,
"tenant_id": "default"
}
No human-readable formatters in production builds. JSON only. Operators use jq or log aggregators.
Health Checks¶
Every binary exposes:
/healthzโ liveness. Returns 200 if the process is alive./readyzโ readiness. Returns 200 only if dependencies (Postgres, Redis) are reachable.
K8s deployments use these for liveness and readiness probes.
Consequences¶
- The dependency footprint grows.
client_golang,go.opentelemetry.io/otel, andslogare mandatory. - Performance overhead is real but small. Metrics and traces are cheap when batched. Logs at INFO level are negligible.
- The CI pipeline must validate that every new metric is registered with a description and that no metric explodes label cardinality.
- Operators get a Grafana dashboard template shipped with the project (in
helm/dashboards/).
Alternatives Rejected¶
- Add observability later: rejected because retrofitting tracing across an existing codebase is enormously expensive.
- Only logs, no metrics or traces: rejected because logs alone cannot answer "is the system slow right now?".
- Custom metrics format: rejected because Prometheus is the industry standard.