- Track a heartbeat timestamp in ~/.sharedinbox-agent-heartbeat at the
start of each _run_loop() invocation so we can tell when it last ran.
- Add `agent_loop.py monitor` subcommand that exits 1 with a WARNING
message if the heartbeat is missing, corrupted, or older than 2 hours.
- Add .forgejo/workflows/monitor.yml scheduled workflow that runs the
monitor check every 2 hours on the self-hosted runner; a CI failure
serves as the warning when the loop is stalled.
- Add 7 unit tests covering all monitor / heartbeat scenarios.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_latest_main_ci_run() was using event != pull_request which still
matched deploy.yml schedule runs when their prettyref == "main",
blocking the loop from picking up new issues.
_latest_ci_run_for_branch() had the same issue: the else branch matched
any non-pull_request event including schedule runs.
Both functions now explicitly filter for event == "push" only.
Tests updated: rename _latest_ci_run → _latest_main_ci_run, mock
_open_issue_prs to prevent real API calls in unit tests, and update
_find_pr_for_branch side_effect to reflect the upstream post-merge
PR-still-open verification check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Forgejo workflow_runs API has no head_branch field. For pull_request
events the branch lives in event_payload["pull_request"]["head"]["ref"];
for push events it is in prettyref. The old code used run.get("head_branch")
which always returned None, causing _latest_ci_run_for_branch to never find
the run and the loop to declare "no CI run after 15 min" and set the issue to
State/Question — even when CI had already passed.
Also fixes a pre-existing test mock that was missing the session_name kwarg.
Adds TestLatestCiRunForBranch covering both event types and the regression.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: create log dir with mode 0700 and enforce it on
existing dirs; open log files with mode 0600; chmod state file
to 0600 after every write. Prevents other local processes from
reading agent output (which may contain credential paths) or
tampering with the state file's pid field.
- ci/main.go (TestAndroidFirebase): replace
echo "$FIREBASE_SA_KEY" > /tmp/key.json
with bash process substitution
--key-file=<(echo "$FIREBASE_SA_KEY")
The key is now passed via a file descriptor — it never touches
disk, so it cannot be stranded by a failed gcloud auth call or
snapshotted into the Dagger layer cache.
- ci.yml / deploy.yml: add "Cleanup TLS credentials" step
(if: always()) at the end of every job that calls
setup_dagger_remote.sh. Removes /tmp/dagger-tls,
/tmp/stunnel-dagger.conf, /tmp/stunnel.pid from the self-hosted
runner after each job, so client certs do not accumulate between
job runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Only pick up issues created by guettli, guettlibot, or guettlibot2
to prevent the loop from acting on external/bot issues.
- Post an explanatory comment on the issue whenever the loop sets
State/Question (agent killed, no CI run, no push detected), so the
reason is visible without digging through cron logs. Closes#158.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: agents now create an `issue-N-fix` branch and open a PR;
the loop discovers the PR via `fgj pr list`, tracks its CI run, squash-merges
on green, and falls back to the global-CI path if no PR exists (backward compat).
Adds `_find_pr_for_branch`, `_latest_ci_run_for_branch`, `_merge_pr` helpers.
- .forgejo/workflows/ci.yml: strip to the single fast `check` job only
(removes build-linux, deploy-playstore, publish-website).
- .forgejo/workflows/deploy.yml (new, replaces android-emulator-tests.yml):
scheduled hourly + workflow_dispatch; runs firebase tests, Play Store deploy,
Linux build/deploy, website publish; on completion sets CI/Full-Pass or
CI/Full-Fail label on the repo's DEPLOY_HEALTH_ISSUE tracking issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the agent exits immediately (e.g. rate-limit), the loop was closing the
pending issue against the *previous* CI run, which was still green.
Fix: record the latest CI run ID when an issue agent starts. If the run ID
hasn't changed when the agent exits, the agent pushed nothing → set
State/Question instead of closing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`tea api` exits 0 even on 401 responses, so `_close_issue` and
`_set_labels` appeared to succeed but did nothing. Issues were never
actually closed, causing them to be picked up again every cron tick.
Switch all write operations (close issue, set labels) and issue-list
reads to `fgj`, which has proper authentication. Keep `tea api` only
for CI run fetches where `fgj` times out (504). Add ~/go/bin to the
cron PATH so fgj is found.
Also add an error check in `_tea_get` for API-level error responses,
and strip State/InProgress when closing an issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously issue agents were instructed to close the issue via prompt text
immediately after pushing. If CI then failed, the issue was already closed.
Now the loop tracks a pending_issue across cron ticks:
- When an agent finishes (issue or ci-fix), the issue number is extracted
from state before it is cleared.
- If CI is still running, a "pending-ci" state preserves the issue number.
- If CI fails, the ci-fix agent is started with the issue number in state
so it survives the fix cycle.
- Once CI passes, _close_issue() is called from Python — never by the agent.
The agent prompt no longer instructs the agent to close the issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `---------------------- Starting YYYY-MM-DD HH:MMZ` header at each run
- Remove `[agent_loop]` prefix from all output lines
- Show full Codeberg URL for CI runs instead of bare run ID
- Show full issue URL and title when referencing issues
- Store issue_title in state file so "still running" messages include the title
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.daggerignore no longer needs to exclude $HOME dirs (fvm/, go/, .pub-cache/,
.claude/, snap/, etc.) since the project root is now sharedinbox/, not $HOME.
agent_loop.py: replace hardcoded /home/si with Path.home().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the tmux-based agent launcher with a direct subprocess.Popen
call. Claude sessions can't be attached to anyway, so the tmux layer
added complexity with no benefit. State now tracks a PID instead of a
tmux session name; liveness is checked with os.kill(pid, 0).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new Claude Code trust dialog appeared inside the tmux PTY despite -p
mode and stdout being piped, blocking the agent indefinitely. With
< /dev/null the dialog could never be answered.
Replace < /dev/null with printf '\n' | so the Enter keypress confirms the
default "Yes, I trust this folder" option. After that single newline stdin
reaches EOF, which -p mode ignores.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously claude was launched with -p (print mode) which produces no
visible TUI. Attaching to the session with `tmux attach -t issue-NNN`
showed a blank terminal. Removing -p makes Claude run its interactive
TUI inside the tmux pane, so the session is fully watchable.
Add scripts/test_agent_loop.py covering _start_agent command
construction and state file round-trips.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without `< /dev/null`, claude detects the tmux PTY as stdin and blocks
waiting for user input that never arrives (the PTY never sends EOF).
The 3-second stdin-timeout only fires for pipe stdin, not TTY stdin.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace bare subprocess.Popen with `tmux new-session -d` so each agent
runs in a detached tmux session that inherits the tmux server's environment
(including ANTHROPIC_API_KEY / keychain access, which cron's minimal env
lacks — the root cause of intermittent empty log files).
- Track agents by tmux session name instead of PID; age is derived from the
state-file `started_at` timestamp rather than /proc/<pid>/stat.
- `_kill_agent` terminates via `tmux kill-session`; backward compat preserved
for old state files that stored a `pid`.
- Operators can now `tmux attach -t issue-<N>` to watch live output, or
`claude --resume issue-<N>` to continue the conversation afterward.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cron runs with a minimal environment that doesn't include ~/.nix-profile/bin,
causing every invocation to crash with FileNotFoundError on 'tea'.
Closes#93
Polls Codeberg CI and State/Ready issues every 10 minutes, launching
Claude Code agents for CI fixes and issue work, with PID-based liveness
tracking and automatic timeout after 1 hour.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>