Agent loop, Concurrent planning #321

Closed
opened 2026-05-28 03:58:54 +00:00 by guettli · 2 comments
guettli commented 2026-05-28 03:58:54 +00:00 (Migrated from codeberg.org)

On hold, think about doing agent loop from scratch, with containers and CRDs.

The current loop runs one agent at a time (including planners). Making planners truly concurrent (e.g., while an implementation CI is pending) requires tracking multiple concurrent agent PIDs.

Create plan with several alternatives.

On hold, think about doing agent loop from scratch, with containers and CRDs. The current loop runs one agent at a time (including planners). Making planners truly concurrent (e.g., while an implementation CI is pending) requires tracking multiple concurrent agent PIDs. Create plan with several alternatives.
guettlibot commented 2026-05-28 04:31:46 +00:00 (Migrated from codeberg.org)

Implementation Plan: Concurrent Agent Loop

Current Architecture

The sequential bottleneck lives in three layers:

  1. internal/loop/loop.go:processRepo() (line 106) — iterates issues with a plain for loop; each call to handleIssue blocks until the agent finishes.
  2. internal/loop/loop.go:handleIssue() (lines 121–197) — one sequential pipeline: set InProgress → clone → run agent → push → create PR → comment → remove InProgress.
  3. internal/agent/claudecode.go:Run() (line 35) — calls cmd.Run() which blocks until the claude process exits (can take minutes).

The existing label-based state machine (State/ReadyState/InProgress) already provides cross-poll deduplication: once an issue is InProgress, ListOpenIssues(Ready) won't return it again. Concurrency therefore only requires making handleIssue non-blocking from the perspective of the outer loop.


Alternative A — Goroutines + Semaphore (minimal change)

Files to change: internal/loop/loop.go, internal/config/config.go

Approach:

  • Add max_concurrent int to config.Config (default: 3).
  • In Runner, add a semaphore: sem chan struct{} (buffered channel of size max_concurrent).
  • In processRepo(), replace the sequential for loop with goroutine launches guarded by the semaphore and a sync.WaitGroup.
// pseudo-code sketch
var wg sync.WaitGroup
for _, issue := range issues {
    r.sem <- struct{}{}   // acquire slot
    wg.Add(1)
    go func(issue platform.Issue) {
        defer wg.Done()
        defer func() { <-r.sem }()  // release slot
        r.handleIssue(plat, repo, issue)
    }(issue)
}
wg.Wait()

PID tracking (requested in issue): Add a sync.Map on Runner keyed by issueKey (e.g. "owner/repo#N") storing the OS PID of the running claude process. agent.ClaudeCode.Run() would need to return the PID or accept a callback; alternatively expose cmd.Process.Pid after cmd.Start().

Advantages: ~30 lines of change, no new state, no new labels.

Risks / open questions:

  • On process restart, any issue stuck in State/InProgress will stay there indefinitely — needs human intervention or a startup cleanup sweep.
  • processRepo is called on every poll tick (potentially while the previous goroutines are still running), so the semaphore must be on Runner, not local to processRepo, to correctly bound total concurrency across overlapping polls. Currently poll() itself is synchronous but that too would need protection.

Alternative B — Persistent State Files + PID Tracking (crash-safe)

Files to change: internal/loop/loop.go, internal/config/config.go; new file: internal/state/state.go

Approach:

  • Introduce a JSON state file per issue at <work_dir>/state/<platform>/<owner>/<repo>/issue-<N>.json:
{ "status": "running", "pid": 12345, "pr_url": "", "started_at": "2026-05-28T10:00:00Z" }
  • Before starting a new agent for an issue, check if a state file exists and whether the PID is still alive (kill(pid, 0) / reading /proc/<pid>).
    • If alive → skip (already running).
    • If dead and status=running → agent crashed; transition issue to State/Question and clean up.
  • On successful PR creation, update status to pr_created and record pr_url.
  • Launch agents in goroutines as in Alternative A, but with the state file as the recovery anchor.

Advantages:

  • Survives process restarts; orphaned State/InProgress issues are self-healing.
  • PID tracking is explicit and auditable.

Risks / open questions:

  • PID reuse: if agentloop restarts and the OS has recycled the old PID, a different process might appear alive. Mitigate by also checking process start time or writing a lockfile the child process holds open.
  • State file and label may drift out of sync (state=running but label was manually changed). Treat the label as authoritative; state file is advisory.

Alternative C — CI-Waiting Label (enables true planner/implementation split)

Files to change: internal/loop/loop.go, internal/config/config.go, internal/platform/platform.go (new method), internal/platform/forgejo.go, internal/platform/github.go

Approach:

  • Add a new label State/WaitingCI to the config and state machine.
  • After pushing the branch and creating the PR, instead of removing State/InProgress, transition the issue to State/WaitingCI.
  • Add a new poll step checkWaitingCI() that:
    1. Fetches all open issues with State/WaitingCI.
    2. For each, queries CI status via the platform API (Forgejo: GET /repos/{owner}/{repo}/statuses/{sha}; GitHub: Checks API).
    3. If all checks pass → close the issue (or transition to a State/Done label).
    4. If any check fails → transition to State/Question and comment with the failure.
  • The planner (initial agent run) can then proceed concurrently with WaitingCI issues, since those issues are no longer blocking a slot.

Platform interface addition needed:

// GetCIStatus returns the combined CI status for the given commit SHA.
GetCIStatus(sha string) (CIStatus, error)
// GetPRHeadSHA returns the HEAD SHA of a PR by its URL or number.
GetPRHeadSHA(prURL string) (string, error)

Advantages:

  • Cleanest separation of concerns; the agent loop truly becomes non-blocking for CI wait time.
  • Enables future distinction between a "planning" agent (fast, produces a spec) and an "implementation" agent (slow, produces code).

Risks / open questions:

  • Requires additional platform API calls; rate-limit budget increases.
  • CI status APIs differ significantly between Forgejo and GitHub — needs careful abstraction.
  • PRs with no CI configured would hang in WaitingCI forever — need a timeout or a config option to skip CI gating.

Alternative D — Worker Pool (most scalable)

Files to change: internal/loop/loop.go, cmd/agentloop/main.go, internal/config/config.go; new file: internal/worker/pool.go

Approach:

  • Create a Pool with N worker goroutines (buffered job channel).
  • Runner.poll() enqueues issues onto the job channel instead of calling handleIssue directly.
  • A running-set (sync.Map) prevents enqueuing the same issue twice across poll cycles.
  • Workers call handleIssue and update the running-set on completion.

Advantages:

  • Naturally bounds total goroutine count; no semaphore needed.
  • Clean separation between dispatch (poll) and execution (worker).
  • Easy to introspect the queue depth.

Risks / open questions:

  • If the job channel is full (all workers busy), poll() must either drop the issue (fine, next poll will retry) or block (risk of poll stacking). Recommend a non-blocking send with a drop-and-log.
  • More code than Alternatives A/B; may be premature for current scale.

Recommended Path

Step Alternative Effort
1. Unblock the loop A (goroutines + semaphore) small
2. Add crash recovery B (state files + PID tracking) medium
3. Add CI-waiting C (WaitingCI label) medium

Alternatives A and B are independent of C and can be shipped first. Alternative C is the feature that makes planners "truly concurrent while CI is pending" as the issue describes — it is likely the primary goal here.


Open Questions

  1. What is a "planner" specifically? Is it a distinct agent role (e.g. first pass produces a design, second pass implements it), or just the initial handleIssue agent run? This affects whether we need a separate label/phase.
  2. Max concurrency limit: Should this be per-repo or global? Default value?
  3. CI gating scope: Should CI status be checked before merging (merge bot) or just before closing the issue? Currently agentloop does not auto-merge PRs.
  4. Restart behavior: Should agentloop automatically retry issues that were InProgress at startup, or require human triage?
## Implementation Plan: Concurrent Agent Loop ### Current Architecture The sequential bottleneck lives in three layers: 1. **`internal/loop/loop.go:processRepo()`** (line 106) — iterates issues with a plain `for` loop; each call to `handleIssue` blocks until the agent finishes. 2. **`internal/loop/loop.go:handleIssue()`** (lines 121–197) — one sequential pipeline: set InProgress → clone → run agent → push → create PR → comment → remove InProgress. 3. **`internal/agent/claudecode.go:Run()`** (line 35) — calls `cmd.Run()` which blocks until the `claude` process exits (can take minutes). The existing label-based state machine (`State/Ready` → `State/InProgress`) already provides cross-poll deduplication: once an issue is `InProgress`, `ListOpenIssues(Ready)` won't return it again. Concurrency therefore only requires making `handleIssue` non-blocking from the perspective of the outer loop. --- ### Alternative A — Goroutines + Semaphore (minimal change) **Files to change:** `internal/loop/loop.go`, `internal/config/config.go` **Approach:** - Add `max_concurrent int` to `config.Config` (default: 3). - In `Runner`, add a semaphore: `sem chan struct{}` (buffered channel of size `max_concurrent`). - In `processRepo()`, replace the sequential `for` loop with goroutine launches guarded by the semaphore and a `sync.WaitGroup`. ```go // pseudo-code sketch var wg sync.WaitGroup for _, issue := range issues { r.sem <- struct{}{} // acquire slot wg.Add(1) go func(issue platform.Issue) { defer wg.Done() defer func() { <-r.sem }() // release slot r.handleIssue(plat, repo, issue) }(issue) } wg.Wait() ``` **PID tracking (requested in issue):** Add a `sync.Map` on `Runner` keyed by `issueKey` (e.g. `"owner/repo#N"`) storing the OS PID of the running `claude` process. `agent.ClaudeCode.Run()` would need to return the PID or accept a callback; alternatively expose `cmd.Process.Pid` after `cmd.Start()`. **Advantages:** ~30 lines of change, no new state, no new labels. **Risks / open questions:** - On process restart, any issue stuck in `State/InProgress` will stay there indefinitely — needs human intervention or a startup cleanup sweep. - `processRepo` is called on every poll tick (potentially while the previous goroutines are still running), so the semaphore must be on `Runner`, not local to `processRepo`, to correctly bound total concurrency across overlapping polls. Currently `poll()` itself is synchronous but that too would need protection. --- ### Alternative B — Persistent State Files + PID Tracking (crash-safe) **Files to change:** `internal/loop/loop.go`, `internal/config/config.go`; **new file:** `internal/state/state.go` **Approach:** - Introduce a JSON state file per issue at `<work_dir>/state/<platform>/<owner>/<repo>/issue-<N>.json`: ```json { "status": "running", "pid": 12345, "pr_url": "", "started_at": "2026-05-28T10:00:00Z" } ``` - Before starting a new agent for an issue, check if a state file exists and whether the PID is still alive (`kill(pid, 0)` / reading `/proc/<pid>`). - If alive → skip (already running). - If dead and status=running → agent crashed; transition issue to `State/Question` and clean up. - On successful PR creation, update status to `pr_created` and record `pr_url`. - Launch agents in goroutines as in Alternative A, but with the state file as the recovery anchor. **Advantages:** - Survives process restarts; orphaned `State/InProgress` issues are self-healing. - PID tracking is explicit and auditable. **Risks / open questions:** - PID reuse: if `agentloop` restarts and the OS has recycled the old PID, a different process might appear alive. Mitigate by also checking process start time or writing a lockfile the child process holds open. - State file and label may drift out of sync (state=running but label was manually changed). Treat the label as authoritative; state file is advisory. --- ### Alternative C — CI-Waiting Label (enables true planner/implementation split) **Files to change:** `internal/loop/loop.go`, `internal/config/config.go`, `internal/platform/platform.go` (new method), `internal/platform/forgejo.go`, `internal/platform/github.go` **Approach:** - Add a new label `State/WaitingCI` to the config and state machine. - After pushing the branch and creating the PR, instead of removing `State/InProgress`, transition the issue to `State/WaitingCI`. - Add a new poll step `checkWaitingCI()` that: 1. Fetches all open issues with `State/WaitingCI`. 2. For each, queries CI status via the platform API (Forgejo: `GET /repos/{owner}/{repo}/statuses/{sha}`; GitHub: Checks API). 3. If all checks pass → close the issue (or transition to a `State/Done` label). 4. If any check fails → transition to `State/Question` and comment with the failure. - The **planner** (initial agent run) can then proceed concurrently with `WaitingCI` issues, since those issues are no longer blocking a slot. **Platform interface addition needed:** ```go // GetCIStatus returns the combined CI status for the given commit SHA. GetCIStatus(sha string) (CIStatus, error) // GetPRHeadSHA returns the HEAD SHA of a PR by its URL or number. GetPRHeadSHA(prURL string) (string, error) ``` **Advantages:** - Cleanest separation of concerns; the agent loop truly becomes non-blocking for CI wait time. - Enables future distinction between a "planning" agent (fast, produces a spec) and an "implementation" agent (slow, produces code). **Risks / open questions:** - Requires additional platform API calls; rate-limit budget increases. - CI status APIs differ significantly between Forgejo and GitHub — needs careful abstraction. - PRs with no CI configured would hang in `WaitingCI` forever — need a timeout or a config option to skip CI gating. --- ### Alternative D — Worker Pool (most scalable) **Files to change:** `internal/loop/loop.go`, `cmd/agentloop/main.go`, `internal/config/config.go`; **new file:** `internal/worker/pool.go` **Approach:** - Create a `Pool` with N worker goroutines (buffered job channel). - `Runner.poll()` enqueues issues onto the job channel instead of calling `handleIssue` directly. - A running-set (`sync.Map`) prevents enqueuing the same issue twice across poll cycles. - Workers call `handleIssue` and update the running-set on completion. **Advantages:** - Naturally bounds total goroutine count; no semaphore needed. - Clean separation between dispatch (poll) and execution (worker). - Easy to introspect the queue depth. **Risks / open questions:** - If the job channel is full (all workers busy), `poll()` must either drop the issue (fine, next poll will retry) or block (risk of poll stacking). Recommend a non-blocking send with a drop-and-log. - More code than Alternatives A/B; may be premature for current scale. --- ### Recommended Path | Step | Alternative | Effort | |------|-------------|--------| | 1. Unblock the loop | A (goroutines + semaphore) | small | | 2. Add crash recovery | B (state files + PID tracking) | medium | | 3. Add CI-waiting | C (WaitingCI label) | medium | Alternatives A and B are independent of C and can be shipped first. Alternative C is the feature that makes planners "truly concurrent while CI is pending" as the issue describes — it is likely the primary goal here. --- ### Open Questions 1. **What is a "planner" specifically?** Is it a distinct agent role (e.g. first pass produces a design, second pass implements it), or just the initial `handleIssue` agent run? This affects whether we need a separate label/phase. 2. **Max concurrency limit:** Should this be per-repo or global? Default value? 3. **CI gating scope:** Should CI status be checked before merging (merge bot) or just before closing the issue? Currently agentloop does not auto-merge PRs. 4. **Restart behavior:** Should `agentloop` automatically retry issues that were `InProgress` at startup, or require human triage?
guettlibot commented 2026-05-28 04:32:02 +00:00 (Migrated from codeberg.org)

Planning complete. To resume this session:

claude --resume e483217b-982a-42bc-ab2a-b539004eb139 --dangerously-skip-permissions
Planning complete. To resume this session: ``` claude --resume e483217b-982a-42bc-ab2a-b539004eb139 --dangerously-skip-permissions ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#321