Run a real Dagger engine in the agentloop agent pods (drop the engine-less skip) #538

Closed
opened 2026-06-08 06:02:42 +00:00 by guettlibot · 1 comment
guettlibot commented 2026-06-08 06:02:42 +00:00 (Migrated from codeberg.org)

Goal

Run the Dagger-backed checks (dart-checkdagger call ... check-fast, and the other dagger call tasks) for real inside the agentloop agent pods, instead of skipping them. Today the agent commits in an engine-less pod, so the dart-check pre-commit hook either fails hard with:

start engine: driver for scheme "image" was not available

or is silently skipped by scripts/precommit_dart_check.sh. The skip is a fallback we want to remove — the agent should get the same validation a developer or CI gets, locally, before it pushes.

What is blocking (diagnosis)

Dagger needs an engine to talk to. The CLI finds one in exactly one of two ways: provision it from a local container runtime (docker/podman/nerdctl), or connect to an existing one via _EXPERIMENTAL_DAGGER_RUNNER_HOST. In the agentloop pod neither exists:

  • No container runtime in the pod. command -v docker / podman → nothing. The node runs containerd (k3s), but there is no Docker socket mounted and no rootless runtime in the image.
  • _EXPERIMENTAL_DAGGER_RUNNER_HOST is unset, so the CLI falls back to its default engine-image reference and aborts with the driver for scheme "image" error.
  • The pod is unprivileged. securityContext is just runAsUser: 1000 / fsGroup: 1000. The Dagger engine (buildkit) needs privileged: true (CAP_SYS_ADMIN, mounts), so it can't run inside the agent container — it has to be a separate, privileged container.
  • Version skew. The CLI is pinned to 0.21.4 (flake.nix override of the dagger/nix 0.20.8), but ci/dagger.json declares engineVersion: v0.20.8. Whatever engine we stand up must match the CLI minor version, i.e. v0.21.4.

So nothing is wrong with the Dagger code — the execution context simply has no engine and no way to reach one.

Plan

Primary approach: Dagger engine sidecar in the agent pod

agentloop passes the operator-supplied PodSpec through verbatim (internal/k8s/job.go, dispatcher.go), and the inline daemon's spec lives in the gitops manifests. So we add the engine at the manifest level — no agentloop code change.

  1. Add a privileged dagger-engine sidecar to the agent execution pod (the agentloop Deployment today; the worker-Pod template once k8s worker mode is in use):
    • image registry.dagger.io/engine:v0.21.4 (match the CLI),
    • securityContext.privileged: true,
    • a shared emptyDir mounted at the engine socket dir in both the engine and agent containers,
    • a persistent cache volume for /var/lib/dagger (hostPath on this single node, or a PVC) so the buildkit cache survives pod restarts and check-fast isn't cold every time.
  2. Point the agent container at it: set _EXPERIMENTAL_DAGGER_RUNNER_HOST=unix:///run/dagger/engine.sock (the shared socket) on the agentloop/worker container env.
  3. Confirm the namespace allows privileged pods (no PodSecurity restricted enforcement on agentloop — verify before relying on it).

Alternative (if we'd rather not run a privileged sidecar per pod): a single shared, privileged dagger-engine Deployment with a cache PVC, reached via _EXPERIMENTAL_DAGGER_RUNNER_HOST=kube-pod://... (requires kubectl + RBAC in the agent image, which we don't have today) or tcp:// (unauthenticated — must be locked down with a NetworkPolicy). The sidecar is preferred because it's self-contained and needs no extra RBAC/networking.

sharedinbox-side changes (this repo)

  1. Align the engine version: make ci/dagger.json's engineVersion compatible with the running engine + CLI 0.21.4 (bump to v0.21.4, or re-pin the CLI — pick one and keep flake.nix, the engine image, and dagger.json in lockstep).
  2. Remove the skip fallback in scripts/precommit_dart_check.sh. Once an engine is reliably present, drop the silent exit 0. If we want a guard at all, invert it so a missing engine is a hard, loud failure in the agent context (so an engine regression is caught immediately) rather than a silent skip.
  3. Make sure the dev shell / hook does not clobber the injected _EXPERIMENTAL_DAGGER_RUNNER_HOST (cf. the recent "drop dead DAGGER_HOST export" cleanup) — the pod env must win.

Cross-repo dependency

Steps 1–3 are infra and land in the agentloop deployment manifests (guettli/gitops + the agentloop runtime image / worker PodSpec template), not in this repo. This issue tracks the goal and the sharedinbox-side changes (4–6); the manifest work should be linked from here.

Acceptance criteria

  • A git commit made by the agent inside an agentloop pod runs dart-check and the Dagger engine actually executes check-fast — passing or failing on real results, with no driver for scheme "image" error and no "skipping dart-check" warning.
  • The other dagger call tasks (analyze, format-write, test-*, …) can run in the same context.
  • The engine version, CLI version, and ci/dagger.json engineVersion are mutually compatible.
  • The engine-less skip path is removed (or inverted to a hard failure).

Context

Surfaced while fixing the agentloop plan-commit flow (the plan bookkeeping commit was wrongly running this project's pre-commit hooks; fixed in guettli/agentloop#173 with --no-verify). That unblocked planning, but the underlying inability to run Dagger in-pod remains and is what this issue addresses.

## Goal Run the Dagger-backed checks (`dart-check` → `dagger call ... check-fast`, and the other `dagger call` tasks) **for real inside the agentloop agent pods**, instead of skipping them. Today the agent commits in an engine-less pod, so the `dart-check` pre-commit hook either fails hard with: ``` start engine: driver for scheme "image" was not available ``` or is silently skipped by `scripts/precommit_dart_check.sh`. The skip is a fallback we want to remove — the agent should get the same validation a developer or CI gets, locally, before it pushes. ## What is blocking (diagnosis) Dagger needs an **engine** to talk to. The CLI finds one in exactly one of two ways: provision it from a local container runtime (docker/podman/nerdctl), or connect to an existing one via `_EXPERIMENTAL_DAGGER_RUNNER_HOST`. In the agentloop pod **neither exists**: - **No container runtime in the pod.** `command -v docker` / `podman` → nothing. The node runs containerd (k3s), but there is no Docker socket mounted and no rootless runtime in the image. - **`_EXPERIMENTAL_DAGGER_RUNNER_HOST` is unset**, so the CLI falls back to its default engine-image reference and aborts with the `driver for scheme "image"` error. - **The pod is unprivileged.** `securityContext` is just `runAsUser: 1000` / `fsGroup: 1000`. The Dagger engine (buildkit) needs `privileged: true` (CAP_SYS_ADMIN, mounts), so it can't run inside the agent container — it has to be a separate, privileged container. - **Version skew.** The CLI is pinned to `0.21.4` (flake.nix override of the dagger/nix `0.20.8`), but `ci/dagger.json` declares `engineVersion: v0.20.8`. Whatever engine we stand up must match the **CLI** minor version, i.e. `v0.21.4`. So nothing is wrong with the Dagger code — the execution context simply has no engine and no way to reach one. ## Plan ### Primary approach: Dagger engine sidecar in the agent pod agentloop passes the **operator-supplied PodSpec through verbatim** (`internal/k8s/job.go`, `dispatcher.go`), and the inline daemon's spec lives in the gitops manifests. So we add the engine at the manifest level — no agentloop code change. 1. **Add a privileged `dagger-engine` sidecar** to the agent execution pod (the `agentloop` Deployment today; the worker-Pod template once k8s worker mode is in use): - image `registry.dagger.io/engine:v0.21.4` (match the CLI), - `securityContext.privileged: true`, - a shared `emptyDir` mounted at the engine socket dir in both the engine and agent containers, - a persistent cache volume for `/var/lib/dagger` (hostPath on this single node, or a PVC) so the buildkit cache survives pod restarts and check-fast isn't cold every time. 2. **Point the agent container at it:** set `_EXPERIMENTAL_DAGGER_RUNNER_HOST=unix:///run/dagger/engine.sock` (the shared socket) on the agentloop/worker container env. 3. **Confirm the namespace allows privileged pods** (no PodSecurity `restricted` enforcement on `agentloop` — verify before relying on it). *Alternative (if we'd rather not run a privileged sidecar per pod):* a single shared, privileged `dagger-engine` Deployment with a cache PVC, reached via `_EXPERIMENTAL_DAGGER_RUNNER_HOST=kube-pod://...` (requires `kubectl` + RBAC in the agent image, which we don't have today) or `tcp://` (unauthenticated — must be locked down with a NetworkPolicy). The sidecar is preferred because it's self-contained and needs no extra RBAC/networking. ### sharedinbox-side changes (this repo) 4. **Align the engine version:** make `ci/dagger.json`'s `engineVersion` compatible with the running engine + CLI `0.21.4` (bump to `v0.21.4`, or re-pin the CLI — pick one and keep flake.nix, the engine image, and `dagger.json` in lockstep). 5. **Remove the skip fallback** in `scripts/precommit_dart_check.sh`. Once an engine is reliably present, drop the silent `exit 0`. If we want a guard at all, invert it so a *missing* engine is a **hard, loud failure** in the agent context (so an engine regression is caught immediately) rather than a silent skip. 6. Make sure the dev shell / hook does **not** clobber the injected `_EXPERIMENTAL_DAGGER_RUNNER_HOST` (cf. the recent "drop dead DAGGER_HOST export" cleanup) — the pod env must win. ### Cross-repo dependency Steps 1–3 are infra and land in the agentloop deployment manifests (`guettli/gitops` + the `agentloop` runtime image / worker PodSpec template), not in this repo. This issue tracks the goal and the sharedinbox-side changes (4–6); the manifest work should be linked from here. ## Acceptance criteria - A `git commit` made by the agent inside an agentloop pod runs `dart-check` and the Dagger engine **actually executes** `check-fast` — passing or failing on real results, with **no** `driver for scheme "image"` error and **no** "skipping dart-check" warning. - The other `dagger call` tasks (`analyze`, `format-write`, `test-*`, …) can run in the same context. - The engine version, CLI version, and `ci/dagger.json` `engineVersion` are mutually compatible. - The engine-less skip path is removed (or inverted to a hard failure). ## Context Surfaced while fixing the agentloop plan-commit flow (the plan bookkeeping commit was wrongly running this project's pre-commit hooks; fixed in guettli/agentloop#173 with `--no-verify`). That unblocked planning, but the underlying inability to run Dagger in-pod remains and is what this issue addresses.
guettli commented 2026-06-08 07:11:02 +00:00 (Migrated from codeberg.org)

No, I will make remote Dagger available in pods of agents.

No, I will make remote Dagger available in pods of agents.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#538