Speed up agent loop and deploy #234

Closed
opened 2026-05-24 19:04:59 +00:00 by guettli · 2 comments
guettli commented 2026-05-24 19:04:59 +00:00 (Migrated from codeberg.org)

Look at the current way the agent loop works.

Create a plan how to speed up the flow. Final goal is to run deploy.yml

Where could caching help?

Where could running concurrently help?

Where could calling cron jobs more often help?

Think about other ways, too.

Look at the current way the agent loop works. Create a plan how to speed up the flow. Final goal is to run deploy.yml Where could caching help? Where could running concurrently help? Where could calling cron jobs more often help? Think about other ways, too.
guettlibot commented 2026-05-24 19:13:59 +00:00 (Migrated from codeberg.org)

Implementation Plan: Speed up agent loop and deploy

After reading the issue and exploring the codebase (crontab, scripts/agent_loop.py, .forgejo/workflows/deploy.yml, scripts/deploy_cron.py, Taskfile.yml, ci/main.go), here is a detailed breakdown of where time is lost and how to recover it.


Current flow and where time is wasted

The end-to-end cycle for a single issue currently looks like this (excluding actual agent work):

  1. Issue gets label → cron fires → planning agent starts: up to 5 min
  2. Planning agent finishes → next cron tick detects it → State/Planned set: up to 5 min
  3. Human sets State/Ready → next cron tick → impl agent starts: up to 5 min
  4. Impl agent finishes → next cron tick detects it → CI running: up to 5 min
  5. CI passes → next cron tick detects it → PR merged: up to 5 min
  6. PR merged to main → hourly schedule fires deploy.yml: up to 59 min

That is up to ~84 minutes of pure polling delay, before any actual build/deploy time.


1. Increase cron frequency for agent_loop.py (Quick win)

File: user crontab
Change: */5 * * * **/1 * * * *

Each state transition in the loop costs up to one cron interval of idle waiting. Steps 1–5 above each burn up to 5 minutes. With 1-minute intervals, those five steps shrink to at most 5 minutes total (instead of 25).

Risk: 5× more invocations. Each run makes 2–4 Codeberg API calls via tea and fgj. This is negligible for a single-user instance and well within Codeberg's rate limits.

Note: The docstring in agent_loop.py still says "every 10 minutes" but the crontab already runs it every 5 minutes — the docstring should be updated to match.


2. Trigger deploy.yml immediately after PR merge (Highest impact)

File: scripts/agent_loop.py
Change: After every successful _merge_pr() call, trigger deploy.yml immediately via the Forgejo API.

deploy.yml is currently scheduled hourly (0 * * * *), meaning up to 59 minutes pass between a PR merging to main and apps being deployed. Triggering it from the agent loop eliminates this gap entirely.

Add a helper after the merge calls (there are two: in section 2b post-agent merge and in the catch-up section):

def _trigger_deploy() -> None:
    subprocess.run(
        ["fgj", "--hostname", "codeberg.org", "actions", "workflow", "run",
         "deploy.yml", "--ref", "main", "--repo", REPO],
        capture_output=True, check=False,  # best-effort, don't fail the loop
    )
    print("Triggered deploy.yml.")

Risk: If multiple PRs merge in quick succession, deploy.yml could be triggered several times within minutes. The check-changes job already skips redundant builds when nothing relevant changed, so this is mostly harmless overhead. Add a guard if desired: check whether a deploy.yml run started within the last N minutes before triggering again.


3. Add a push trigger to deploy.yml (Complementary to #2)

File: .forgejo/workflows/deploy.yml
Change: Add a push trigger with path filters alongside the existing schedule:

on:
  schedule:
    - cron: '0 * * * *'
  push:
    branches: [main]
    paths:
      - 'android/**'
      - 'integration_test/**'
      - 'lib/**'
      - 'pubspec.yaml'
      - 'pubspec.lock'
      - 'drift_schemas/**'
      - 'scripts/deploy_playstore.py'
      - 'linux/**'
  workflow_dispatch:

This is complementary to option #2 (the loop trigger fires even for non-source changes; the push trigger fires only when relevant files change). Either approach eliminates the hourly wait; both together give belt-and-suspenders coverage.

Risk: Forgejo evaluates the paths filter the same way as check-changes — if neither Android nor Linux source files changed, deploy.yml won't run. The existing hourly schedule still covers edge cases (e.g., infra/config changes not covered by the path filter).


4. Fix the hourly change-detection window in deploy.yml

File: .forgejo/workflows/deploy.yml, check-changes job
Current bug: git diff --name-only HEAD~1 HEAD compares only the last commit. On the hourly schedule, if two PRs merged since the last run, the first PR's file changes are invisible. Its Android/Linux changes will be silently skipped, and no build will fire.

Fix (simplest): For scheduled runs, always build. workflow_dispatch already sets both flags to true. Extend that logic:

if [ "" = "workflow_dispatch" ] || [ "" = "schedule" ]; then
  echo "android=true" >> ""
  echo "linux=true"   >> ""
  exit 0
fi

Dagger's caching means redundant builds are cheap — if nothing changed, Dagger replays from cache. The expensive steps (Gradle compilation, Flutter build) are fully cached on the self-hosted runner's Dagger engine volumes.

Alternative fix: Increase fetch-depth and diff from the last successful deploy's SHA, stored in a file on the runner. More precise, more complex.


5. Allow concurrent planning + implementation agents (Throughput)

File: scripts/agent_loop.py
Change: Replace the single-agent state file with separate state tracking for plan agents and impl agents, allowing one of each to run simultaneously.

Currently the loop starts either a plan agent or an impl agent per tick, never both. When a planning agent is running, no implementation work happens — even if a separate issue with State/Ready is waiting.

Approach: Use two state files (~/.sharedinbox-plan-state.json and ~/.sharedinbox-impl-state.json), or store a list in the existing state file. The loop logic becomes:

  • If no plan agent running and a ToPlan issue exists → start plan agent
  • If no impl agent running and a Ready issue exists (and CI is clean) → start impl agent
  • Both can run simultaneously

Risk: Two agents running task check concurrently both invoke Dagger. Since Dagger supports concurrent access to its cache volumes, this is generally safe but may slow individual runs on a resource-constrained runner. A ci-fix agent should remain mutually exclusive with impl agents (to avoid main-branch conflicts). This is the most complex change here.

Recommendation: Only worth doing once there is a persistent backlog of issues; for typical use (a few issues at a time), options 1–4 will provide more practical speed-up with far less complexity.


6. Minor: decouple publish-website from Firebase test results

File: .forgejo/workflows/deploy.yml
Current: publish-website depends on [build-linux, deploy-playstore, deploy-apk]. If Firebase tests fail (a separate job), deploy-playstore skips/fails, which in turn blocks publish-website.

The website publish (Hugo build + rsync) is independent of whether Android tests pass. Consider gating publish-website only on the build jobs that actually produced deployable artifacts, not on test results.


Summary and recommended order

# Change Files Impact Effort Risk
1 Cron */1 crontab Saves ~20 min per full cycle 1 line Very low
2 Trigger deploy.yml after merge agent_loop.py Eliminates up to 59 min deploy lag ~15 lines Low
3 Push trigger on deploy.yml deploy.yml Same as #2 via CI ~10 lines Low
4 Fix hourly diff window deploy.yml Correctness: no missed builds ~5 lines Low
5 Parallel plan+impl agents agent_loop.py Throughput with multiple issues Large refactor Medium
6 Decouple publish-website deploy.yml Website deploys more reliably ~5 lines Low

Start with 1, 2, 4 — these are the highest-value, lowest-risk changes and can all be implemented in one small PR. Add option 3 if belt-and-suspenders deploy triggering is desired. Defer option 5 until there is a real throughput bottleneck.

## Implementation Plan: Speed up agent loop and deploy After reading the issue and exploring the codebase (crontab, `scripts/agent_loop.py`, `.forgejo/workflows/deploy.yml`, `scripts/deploy_cron.py`, `Taskfile.yml`, `ci/main.go`), here is a detailed breakdown of where time is lost and how to recover it. --- ### Current flow and where time is wasted The end-to-end cycle for a single issue currently looks like this (excluding actual agent work): 1. Issue gets label → cron fires → planning agent starts: **up to 5 min** 2. Planning agent finishes → next cron tick detects it → State/Planned set: **up to 5 min** 3. Human sets State/Ready → next cron tick → impl agent starts: **up to 5 min** 4. Impl agent finishes → next cron tick detects it → CI running: **up to 5 min** 5. CI passes → next cron tick detects it → PR merged: **up to 5 min** 6. PR merged to main → hourly schedule fires deploy.yml: **up to 59 min** That is up to **~84 minutes of pure polling delay**, before any actual build/deploy time. --- ### 1. Increase cron frequency for `agent_loop.py` (Quick win) **File:** user crontab **Change:** `*/5 * * * *` → `*/1 * * * *` Each state transition in the loop costs up to one cron interval of idle waiting. Steps 1–5 above each burn up to 5 minutes. With 1-minute intervals, those five steps shrink to at most 5 minutes total (instead of 25). **Risk:** 5× more invocations. Each run makes 2–4 Codeberg API calls via `tea` and `fgj`. This is negligible for a single-user instance and well within Codeberg's rate limits. **Note:** The docstring in `agent_loop.py` still says "every 10 minutes" but the crontab already runs it every 5 minutes — the docstring should be updated to match. --- ### 2. Trigger `deploy.yml` immediately after PR merge (Highest impact) **File:** `scripts/agent_loop.py` **Change:** After every successful `_merge_pr()` call, trigger `deploy.yml` immediately via the Forgejo API. `deploy.yml` is currently scheduled hourly (`0 * * * *`), meaning up to 59 minutes pass between a PR merging to main and apps being deployed. Triggering it from the agent loop eliminates this gap entirely. Add a helper after the merge calls (there are two: in section 2b post-agent merge and in the catch-up section): ```python def _trigger_deploy() -> None: subprocess.run( ["fgj", "--hostname", "codeberg.org", "actions", "workflow", "run", "deploy.yml", "--ref", "main", "--repo", REPO], capture_output=True, check=False, # best-effort, don't fail the loop ) print("Triggered deploy.yml.") ``` **Risk:** If multiple PRs merge in quick succession, `deploy.yml` could be triggered several times within minutes. The `check-changes` job already skips redundant builds when nothing relevant changed, so this is mostly harmless overhead. Add a guard if desired: check whether a `deploy.yml` run started within the last N minutes before triggering again. --- ### 3. Add a `push` trigger to `deploy.yml` (Complementary to #2) **File:** `.forgejo/workflows/deploy.yml` **Change:** Add a `push` trigger with path filters alongside the existing schedule: ```yaml on: schedule: - cron: '0 * * * *' push: branches: [main] paths: - 'android/**' - 'integration_test/**' - 'lib/**' - 'pubspec.yaml' - 'pubspec.lock' - 'drift_schemas/**' - 'scripts/deploy_playstore.py' - 'linux/**' workflow_dispatch: ``` This is complementary to option #2 (the loop trigger fires even for non-source changes; the push trigger fires only when relevant files change). Either approach eliminates the hourly wait; both together give belt-and-suspenders coverage. **Risk:** Forgejo evaluates the `paths` filter the same way as `check-changes` — if neither Android nor Linux source files changed, deploy.yml won't run. The existing hourly schedule still covers edge cases (e.g., infra/config changes not covered by the path filter). --- ### 4. Fix the hourly change-detection window in `deploy.yml` **File:** `.forgejo/workflows/deploy.yml`, `check-changes` job **Current bug:** `git diff --name-only HEAD~1 HEAD` compares only the last commit. On the hourly schedule, if two PRs merged since the last run, the first PR's file changes are invisible. Its Android/Linux changes will be silently skipped, and no build will fire. **Fix (simplest):** For scheduled runs, always build. `workflow_dispatch` already sets both flags to `true`. Extend that logic: ```bash if [ "" = "workflow_dispatch" ] || [ "" = "schedule" ]; then echo "android=true" >> "" echo "linux=true" >> "" exit 0 fi ``` Dagger's caching means redundant builds are cheap — if nothing changed, Dagger replays from cache. The expensive steps (Gradle compilation, Flutter build) are fully cached on the self-hosted runner's Dagger engine volumes. **Alternative fix:** Increase `fetch-depth` and diff from the last successful deploy's SHA, stored in a file on the runner. More precise, more complex. --- ### 5. Allow concurrent planning + implementation agents (Throughput) **File:** `scripts/agent_loop.py` **Change:** Replace the single-agent state file with separate state tracking for plan agents and impl agents, allowing one of each to run simultaneously. Currently the loop starts either a plan agent or an impl agent per tick, never both. When a planning agent is running, no implementation work happens — even if a separate issue with `State/Ready` is waiting. **Approach:** Use two state files (`~/.sharedinbox-plan-state.json` and `~/.sharedinbox-impl-state.json`), or store a list in the existing state file. The loop logic becomes: - If no plan agent running and a ToPlan issue exists → start plan agent - If no impl agent running and a Ready issue exists (and CI is clean) → start impl agent - Both can run simultaneously **Risk:** Two agents running `task check` concurrently both invoke Dagger. Since Dagger supports concurrent access to its cache volumes, this is generally safe but may slow individual runs on a resource-constrained runner. A ci-fix agent should remain mutually exclusive with impl agents (to avoid main-branch conflicts). This is the most complex change here. **Recommendation:** Only worth doing once there is a persistent backlog of issues; for typical use (a few issues at a time), options 1–4 will provide more practical speed-up with far less complexity. --- ### 6. Minor: decouple `publish-website` from Firebase test results **File:** `.forgejo/workflows/deploy.yml` **Current:** `publish-website` depends on `[build-linux, deploy-playstore, deploy-apk]`. If Firebase tests fail (a separate job), `deploy-playstore` skips/fails, which in turn blocks `publish-website`. The website publish (Hugo build + rsync) is independent of whether Android tests pass. Consider gating `publish-website` only on the build jobs that actually produced deployable artifacts, not on test results. --- ### Summary and recommended order | # | Change | Files | Impact | Effort | Risk | |---|--------|-------|--------|--------|------| | 1 | Cron `*/1` | crontab | Saves ~20 min per full cycle | 1 line | Very low | | 2 | Trigger deploy.yml after merge | `agent_loop.py` | Eliminates up to 59 min deploy lag | ~15 lines | Low | | 3 | Push trigger on deploy.yml | `deploy.yml` | Same as #2 via CI | ~10 lines | Low | | 4 | Fix hourly diff window | `deploy.yml` | Correctness: no missed builds | ~5 lines | Low | | 5 | Parallel plan+impl agents | `agent_loop.py` | Throughput with multiple issues | Large refactor | Medium | | 6 | Decouple publish-website | `deploy.yml` | Website deploys more reliably | ~5 lines | Low | **Start with 1, 2, 4** — these are the highest-value, lowest-risk changes and can all be implemented in one small PR. Add option 3 if belt-and-suspenders deploy triggering is desired. Defer option 5 until there is a real throughput bottleneck.
guettlibot commented 2026-05-24 19:15:02 +00:00 (Migrated from codeberg.org)

Planning complete. To resume this session:

claude --resume 8cc2138f-1fe8-46e5-8ad4-188ab397e80c
Planning complete. To resume this session: ``` claude --resume 8cc2138f-1fe8-46e5-8ad4-188ab397e80c ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#234