Agent loop, add label Question, avoid endless loop #239

Closed
opened 2026-05-25 05:52:22 +00:00 by guettli · 3 comments
guettli commented 2026-05-25 05:52:22 +00:00 (Migrated from codeberg.org)

There was an endless loop.

See

https://codeberg.org/guettli/sharedinbox/issues/210

Break such loops by setting label Question.

Update agent loop.

Create a plan.

There was an endless loop. See https://codeberg.org/guettli/sharedinbox/issues/210 Break such loops by setting label Question. Update agent loop. Create a plan.
guettlibot commented 2026-05-25 07:07:54 +00:00 (Migrated from codeberg.org)

Implementation Plan

Root Cause Analysis (from issue #210)

Issue #210 experienced an endless loop: PR #215 was stuck in a state where _merge_pr() returned without error but the PR remained open. The catch-up section in _run_loop() kept finding the same open PR on every cron tick, attempting the merge, finding it still open, and posting another "Automatic merge of PR #215 failed" comment — generating 148 identical comments over ~17 hours.

The current code already has a State/Question check to break such loops (line 695–697 of scripts/agent_loop.py), but there are three gaps that allow the protection to be bypassed:

Gap 1 — _get_issue_labels() can raise and crash the loop.
The check at line 695 calls _get_issue_labels(issue_num), which calls _tea_get(), which raises RuntimeError on any API error or non-zero exit. An unhandled exception propagates up through _run_loop(), the cron job exits with an error, and on the next tick the same PR is found again — State/Question is never checked.

Gap 2 — State/Question is NOT set when _merge_pr() raises RuntimeError.
Lines 699–702 catch the exception and continue, but they do not set State/Question. On every subsequent cron tick, the PR is found open, CI is success, LABEL_QUESTION is not present → merge is retried indefinitely.

Gap 3 — State/Question check is too narrow.
The LABEL_QUESTION check is nested inside if pr_run and pr_run.get("status") == "success". If for any reason the CI run is absent or its status changes (e.g. re-run), the check is skipped entirely.


Files to Change

Only one file: scripts/agent_loop.py

Supporting changes: scripts/test_agent_loop.py


Detailed Changes

Change 1 — Harden State/Question check; move it before CI status checks (catch-up loop, ~line 683)

Currently:

for pr in open_prs:
    pr_number = pr["number"]
    ...
    pr_run = _latest_ci_run_for_pr(pr_number)

    if pr_run and pr_run.get("status") == "running":
        ...

    if pr_run and pr_run.get("status") in ("failure", "error"):
        ...

    if pr_run and pr_run.get("status") == "success":
        if issue_num and LABEL_QUESTION in _get_issue_labels(issue_num):
            ...
            continue
        try:
            _merge_pr(pr_number)
        except RuntimeError as e:
            print(f"Catch-up: merge of PR #{pr_number} failed: {e} — skipping.")
            continue

Proposed — move the State/Question guard to the top of the loop body (before the CI-status checks), and add try/except:

for pr in open_prs:
    pr_number = pr["number"]
    ...

    # Guard: skip any PR whose issue is already blocked on a human.
    if issue_num:
        try:
            if LABEL_QUESTION in _get_issue_labels(issue_num):
                print(f"Catch-up: PR #{pr_number} — issue #{issue_num} is State/Question, skipping.")
                continue
        except Exception as e:
            print(f"Catch-up: could not fetch labels for issue #{issue_num}: {e} — skipping PR #{pr_number}.")
            continue

    pr_run = _latest_ci_run_for_pr(pr_number)
    ...

This ensures the guard fires regardless of CI status, and that an API error while checking labels causes a safe skip rather than a propagating exception.

Change 2 — Set State/Question when _merge_pr() raises in catch-up loop (~line 699)

Currently:

try:
    _merge_pr(pr_number)
except RuntimeError as e:
    print(f"Catch-up: merge of PR #{pr_number} failed: {e} — skipping.")
    continue

Proposed:

try:
    _merge_pr(pr_number)
except RuntimeError as e:
    print(f"Catch-up: merge of PR #{pr_number} failed: {e} — setting to State/Question.")
    if issue_num:
        try:
            _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS])
            _comment_issue(
                issue_num,
                f"Automatic merge of PR #{pr_number} failed: {e}. Please merge manually.",
            )
        except Exception as label_err:
            print(f"Catch-up: also failed to set State/Question on issue #{issue_num}: {label_err}")
    continue

Change 3 — Wrap the State/Question-setting code in the existing "still open after merge" path in try/except (~line 706)

Currently, if _set_labels() or _comment_issue() raises in the "PR still open after merge" path, the exception propagates and the label is never reliably set. The fix is to wrap:

if _find_pr_for_branch(branch):
    print(...)
    if issue_num:
        try:
            _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS])
            _comment_issue(issue_num, "...")
        except Exception as e:
            print(f"Catch-up: failed to update issue #{issue_num}: {e}")
    continue

This ensures the outer continue is always reached even if the Codeberg API calls fail, preventing the exception from crashing the loop.

Change 4 — Add comment deduplication helper

To prevent a secondary flood in edge cases where State/Question is set but keeps getting bypassed, add a helper that checks whether the most recent comment on an issue already contains a given message prefix, and skips posting if so:

def _last_comment_body(issue: int) -> str:
    """Return the body of the most recent comment on an issue, or empty string."""    data = _tea_get(f"repos/{REPO}/issues/{issue}/comments?limit=1&token=...")
    ...

Actually, checking existing comments adds API calls and complexity. A simpler alternative is to rely on the State/Question mechanism being reliable after Changes 1–3, and add a comment-rate-limit field to the state file. This change is lower priority; if Changes 1–3 are implemented, the flooding scenario should not recur. Defer comment deduplication unless testing reveals it is still needed.


Test Changes (scripts/test_agent_loop.py)

Add a new test class TestCatchUpLoop (or extend the existing suite) covering:

  1. _get_issue_labels raises → catch-up loop skips the PR gracefully, does not crash, does not attempt merge.
  2. _merge_pr raises RuntimeError → catch-up loop sets State/Question label on the issue and posts a comment.
  3. Issue already has State/Question → catch-up loop skips the PR without attempting merge or posting a comment.
  4. Merge succeeds but PR still open, then second tick → second tick finds State/Question and skips, no duplicate comment posted.

The existing test patterns in test_agent_loop.py use unittest.mock.patch on agent_loop._fgj, agent_loop._tea_get, etc. — follow those conventions.


Risks and Open Questions

  1. Does State/Question conflict with other labels? The label-set call uses remove=[LABEL_IN_PROGRESS] but does not remove LABEL_READY or LABEL_TO_PLAN. If an issue somehow has both State/Ready and State/Question, the next _ready_issues() call would still pick it up. Consider whether the set should also explicitly remove LABEL_READY and LABEL_TO_PLAN.

  2. The _latest_ci_run_for_pr call is inside the loop and makes an API request per PR. If there are many open issue-fix PRs, this generates many API calls. Not a blocking issue now, but worth noting.

  3. Idempotency of _set_labels: If State/Question is already set and we call _set_labels(add=[LABEL_QUESTION], ...) again, does Codeberg/fgj error? It should be a no-op, but worth verifying.

  4. The issue title says "Update agent loop" broadly. The above plan addresses the specific endless-loop bug from issue #210. If there are additional loop improvements intended (e.g. better CI-fix retry limiting, main-branch loop guards), those should be tracked as separate issues or clarified here.

## Implementation Plan ### Root Cause Analysis (from issue #210) Issue #210 experienced an endless loop: PR #215 was stuck in a state where `_merge_pr()` returned without error but the PR remained open. The catch-up section in `_run_loop()` kept finding the same open PR on every cron tick, attempting the merge, finding it still open, and posting another "Automatic merge of PR #215 failed" comment — generating 148 identical comments over ~17 hours. The current code already has a State/Question check to break such loops (line 695–697 of `scripts/agent_loop.py`), but there are three gaps that allow the protection to be bypassed: **Gap 1 — `_get_issue_labels()` can raise and crash the loop.** The check at line 695 calls `_get_issue_labels(issue_num)`, which calls `_tea_get()`, which raises `RuntimeError` on any API error or non-zero exit. An unhandled exception propagates up through `_run_loop()`, the cron job exits with an error, and on the next tick the same PR is found again — State/Question is never checked. **Gap 2 — State/Question is NOT set when `_merge_pr()` raises RuntimeError.** Lines 699–702 catch the exception and `continue`, but they do not set State/Question. On every subsequent cron tick, the PR is found open, CI is success, LABEL_QUESTION is not present → merge is retried indefinitely. **Gap 3 — State/Question check is too narrow.** The LABEL_QUESTION check is nested inside `if pr_run and pr_run.get("status") == "success"`. If for any reason the CI run is absent or its status changes (e.g. re-run), the check is skipped entirely. --- ### Files to Change **Only one file**: `scripts/agent_loop.py` Supporting changes: `scripts/test_agent_loop.py` --- ### Detailed Changes #### Change 1 — Harden State/Question check; move it before CI status checks (catch-up loop, ~line 683) Currently: ```python for pr in open_prs: pr_number = pr["number"] ... pr_run = _latest_ci_run_for_pr(pr_number) if pr_run and pr_run.get("status") == "running": ... if pr_run and pr_run.get("status") in ("failure", "error"): ... if pr_run and pr_run.get("status") == "success": if issue_num and LABEL_QUESTION in _get_issue_labels(issue_num): ... continue try: _merge_pr(pr_number) except RuntimeError as e: print(f"Catch-up: merge of PR #{pr_number} failed: {e} — skipping.") continue ``` Proposed — move the State/Question guard to the top of the loop body (before the CI-status checks), and add try/except: ```python for pr in open_prs: pr_number = pr["number"] ... # Guard: skip any PR whose issue is already blocked on a human. if issue_num: try: if LABEL_QUESTION in _get_issue_labels(issue_num): print(f"Catch-up: PR #{pr_number} — issue #{issue_num} is State/Question, skipping.") continue except Exception as e: print(f"Catch-up: could not fetch labels for issue #{issue_num}: {e} — skipping PR #{pr_number}.") continue pr_run = _latest_ci_run_for_pr(pr_number) ... ``` This ensures the guard fires regardless of CI status, and that an API error while checking labels causes a safe skip rather than a propagating exception. #### Change 2 — Set State/Question when `_merge_pr()` raises in catch-up loop (~line 699) Currently: ```python try: _merge_pr(pr_number) except RuntimeError as e: print(f"Catch-up: merge of PR #{pr_number} failed: {e} — skipping.") continue ``` Proposed: ```python try: _merge_pr(pr_number) except RuntimeError as e: print(f"Catch-up: merge of PR #{pr_number} failed: {e} — setting to State/Question.") if issue_num: try: _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS]) _comment_issue( issue_num, f"Automatic merge of PR #{pr_number} failed: {e}. Please merge manually.", ) except Exception as label_err: print(f"Catch-up: also failed to set State/Question on issue #{issue_num}: {label_err}") continue ``` #### Change 3 — Wrap the State/Question-setting code in the existing "still open after merge" path in try/except (~line 706) Currently, if `_set_labels()` or `_comment_issue()` raises in the "PR still open after merge" path, the exception propagates and the label is never reliably set. The fix is to wrap: ```python if _find_pr_for_branch(branch): print(...) if issue_num: try: _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS]) _comment_issue(issue_num, "...") except Exception as e: print(f"Catch-up: failed to update issue #{issue_num}: {e}") continue ``` This ensures the outer `continue` is always reached even if the Codeberg API calls fail, preventing the exception from crashing the loop. #### Change 4 — Add comment deduplication helper To prevent a secondary flood in edge cases where State/Question is set but keeps getting bypassed, add a helper that checks whether the most recent comment on an issue already contains a given message prefix, and skips posting if so: ```python def _last_comment_body(issue: int) -> str: """Return the body of the most recent comment on an issue, or empty string.""" data = _tea_get(f"repos/{REPO}/issues/{issue}/comments?limit=1&token=...") ... ``` Actually, checking existing comments adds API calls and complexity. A simpler alternative is to rely on the State/Question mechanism being reliable after Changes 1–3, and add a comment-rate-limit field to the state file. **This change is lower priority**; if Changes 1–3 are implemented, the flooding scenario should not recur. Defer comment deduplication unless testing reveals it is still needed. --- ### Test Changes (`scripts/test_agent_loop.py`) Add a new test class `TestCatchUpLoop` (or extend the existing suite) covering: 1. **`_get_issue_labels` raises** → catch-up loop skips the PR gracefully, does not crash, does not attempt merge. 2. **`_merge_pr` raises RuntimeError** → catch-up loop sets State/Question label on the issue and posts a comment. 3. **Issue already has State/Question** → catch-up loop skips the PR without attempting merge or posting a comment. 4. **Merge succeeds but PR still open, then second tick** → second tick finds State/Question and skips, no duplicate comment posted. The existing test patterns in `test_agent_loop.py` use `unittest.mock.patch` on `agent_loop._fgj`, `agent_loop._tea_get`, etc. — follow those conventions. --- ### Risks and Open Questions 1. **Does State/Question conflict with other labels?** The label-set call uses `remove=[LABEL_IN_PROGRESS]` but does not remove `LABEL_READY` or `LABEL_TO_PLAN`. If an issue somehow has both State/Ready and State/Question, the next `_ready_issues()` call would still pick it up. Consider whether the set should also explicitly remove LABEL_READY and LABEL_TO_PLAN. 2. **The `_latest_ci_run_for_pr` call is inside the loop and makes an API request per PR.** If there are many open issue-fix PRs, this generates many API calls. Not a blocking issue now, but worth noting. 3. **Idempotency of `_set_labels`**: If State/Question is already set and we call `_set_labels(add=[LABEL_QUESTION], ...)` again, does Codeberg/fgj error? It should be a no-op, but worth verifying. 4. **The issue title says "Update agent loop" broadly.** The above plan addresses the specific endless-loop bug from issue #210. If there are additional loop improvements intended (e.g. better CI-fix retry limiting, main-branch loop guards), those should be tracked as separate issues or clarified here.
guettlibot commented 2026-05-25 07:08:02 +00:00 (Migrated from codeberg.org)

Planning complete. To resume this session:

claude --resume 95feae02-a82a-4e97-b13c-ef0ed52697b3
Planning complete. To resume this session: ``` claude --resume 95feae02-a82a-4e97-b13c-ef0ed52697b3 ```
guettlibot commented 2026-05-25 07:11:06 +00:00 (Migrated from codeberg.org)

Implementation Plan (updated)

What was already done

Commit 696dcdc added _get_issue_labels() and placed a State/Question guard inside the catch-up loop's status == "success" branch. This stops the most common loop: repeated merge retries when CI is passing.

Remaining gaps

The previous plan comment identified three gaps. Commit 696dcdc addressed gap 3 partially, but gaps 1 and 2 are still open:

Gap 1 — _get_issue_labels() can raise and bypass the guard.
_get_issue_labels() calls _tea_get(), which raises RuntimeError on any API error. If the call fails, the exception propagates out of the catch-up loop uncaught, the guard is never evaluated, and the merge attempt happens anyway. After a failed merge the loop exits with an exception rather than setting State/Question.

Gap 2 — _merge_pr() raising RuntimeError in catch-up does not set State/Question.
Lines ~699–702 catch the exception and continue without setting the label. On the next tick the PR is found again, CI is still passing, State/Question is absent → merge is retried indefinitely.

Gap 3 (partial) — guard is inside status == "success" only.
If pr_run is None or has an unexpected status, the guard is skipped entirely. Moving the guard to the top of the for pr in open_prs: loop body (before the CI-status checks) would make the protection unconditional.

Changes required

File: scripts/agent_loop.py

Change A — Move the State/Question guard to the top of the catch-up for loop and wrap it in try/except:

for pr in open_prs:
    pr_number = pr["number"]
    ...
    m = re.match(r"^issue-(\d+)-fix$", branch or "")
    issue_num = int(m.group(1)) if m else None

    # Guard — skip any PR whose issue is already blocked on a human.
    if issue_num:
        try:
            if LABEL_QUESTION in _get_issue_labels(issue_num):
                print(f"Catch-up: PR #{pr_number} — issue #{issue_num} is State/Question, skipping.")
                continue
        except Exception as e:
            print(f"Catch-up: could not fetch labels for issue #{issue_num}: {e} — skipping PR #{pr_number} to be safe.")
            continue

    pr_run = _latest_ci_run_for_pr(pr_number)
    ...

Remove the duplicate guard that is currently inside the status == "success" block.

Change B — Set State/Question when _merge_pr() raises in the catch-up loop:

try:
    _merge_pr(pr_number)
except RuntimeError as e:
    print(f"Catch-up: merge of PR #{pr_number} failed: {e} — setting to State/Question.")
    if issue_num:
        try:
            _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS])
            _comment_issue(
                issue_num,
                f"Automatic merge of PR #{pr_number} failed: {e}. Please merge manually.",
            )
        except Exception as label_err:
            print(f"Catch-up: also failed to set State/Question on issue #{issue_num}: {label_err}")
    continue

Change C — Wrap the existing "PR still open after merge" State/Question-setting block in try/except so that a Codeberg API failure during the label update does not prevent continue from being reached:

if _find_pr_for_branch(branch):
    print(...)
    if issue_num:
        try:
            _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS])
            _comment_issue(issue_num, "...")
        except Exception as e:
            print(f"Catch-up: failed to update issue #{issue_num}: {e}")
    continue   # always reached

File: scripts/test_agent_loop.py

Add tests in a new class (e.g. TestCatchUpLoopGuards):

  1. _get_issue_labels raises → catch-up skips the PR, no merge attempted, no crash.
  2. _merge_pr raises RuntimeError → catch-up sets State/Question on the issue and posts a comment.
  3. Issue already has State/Question and CI is not "success" (e.g. pr_run=None) → guard at top of loop still fires, merge skipped.

Follow the existing unittest.mock.patch patterns in the test file.

Risks and open questions

  1. Idempotency of _set_labels: Calling add=[LABEL_QUESTION] when the label is already present should be a no-op on Codeberg. The fgj tool exits 0 in this case. No special handling needed.

  2. Should LABEL_READY and LABEL_TO_PLAN also be removed when setting State/Question? The current code only removes LABEL_IN_PROGRESS. If an issue somehow has both State/Ready and State/Question, it would still appear in _ready_issues(). Low-risk in practice but worth removing them defensively.

  3. Section 2b (regular pending-issue flow) has no State/Question guard before merge. If the catch-up writes a pending-ci state for an issue that already has State/Question (e.g. after a new CI run starts on the PR branch), section 2b will attempt the merge without checking. This is a lower-priority gap because catch-up's pending-ci write happens only when CI is still running — after Change A is in place, the catch-up will skip the PR entirely when State/Question is set, so the pending-ci write won't occur. Mentioning it for completeness.

## Implementation Plan (updated) ### What was already done Commit `696dcdc` added `_get_issue_labels()` and placed a `State/Question` guard inside the catch-up loop's `status == "success"` branch. This stops the most common loop: repeated merge retries when CI is passing. ### Remaining gaps The previous plan comment identified three gaps. Commit `696dcdc` addressed gap 3 partially, but gaps 1 and 2 are still open: **Gap 1 — `_get_issue_labels()` can raise and bypass the guard.** `_get_issue_labels()` calls `_tea_get()`, which raises `RuntimeError` on any API error. If the call fails, the exception propagates out of the catch-up loop uncaught, the guard is never evaluated, and the merge attempt happens anyway. After a failed merge the loop exits with an exception rather than setting `State/Question`. **Gap 2 — `_merge_pr()` raising `RuntimeError` in catch-up does not set `State/Question`.** Lines ~699–702 catch the exception and `continue` without setting the label. On the next tick the PR is found again, CI is still passing, `State/Question` is absent → merge is retried indefinitely. **Gap 3 (partial) — guard is inside `status == "success"` only.** If `pr_run` is `None` or has an unexpected status, the guard is skipped entirely. Moving the guard to the top of the `for pr in open_prs:` loop body (before the CI-status checks) would make the protection unconditional. ### Changes required #### File: `scripts/agent_loop.py` **Change A** — Move the `State/Question` guard to the top of the catch-up `for` loop and wrap it in `try/except`: ```python for pr in open_prs: pr_number = pr["number"] ... m = re.match(r"^issue-(\d+)-fix$", branch or "") issue_num = int(m.group(1)) if m else None # Guard — skip any PR whose issue is already blocked on a human. if issue_num: try: if LABEL_QUESTION in _get_issue_labels(issue_num): print(f"Catch-up: PR #{pr_number} — issue #{issue_num} is State/Question, skipping.") continue except Exception as e: print(f"Catch-up: could not fetch labels for issue #{issue_num}: {e} — skipping PR #{pr_number} to be safe.") continue pr_run = _latest_ci_run_for_pr(pr_number) ... ``` Remove the duplicate guard that is currently inside the `status == "success"` block. **Change B** — Set `State/Question` when `_merge_pr()` raises in the catch-up loop: ```python try: _merge_pr(pr_number) except RuntimeError as e: print(f"Catch-up: merge of PR #{pr_number} failed: {e} — setting to State/Question.") if issue_num: try: _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS]) _comment_issue( issue_num, f"Automatic merge of PR #{pr_number} failed: {e}. Please merge manually.", ) except Exception as label_err: print(f"Catch-up: also failed to set State/Question on issue #{issue_num}: {label_err}") continue ``` **Change C** — Wrap the existing "PR still open after merge" State/Question-setting block in `try/except` so that a Codeberg API failure during the label update does not prevent `continue` from being reached: ```python if _find_pr_for_branch(branch): print(...) if issue_num: try: _set_labels(issue_num, add=[LABEL_QUESTION], remove=[LABEL_IN_PROGRESS]) _comment_issue(issue_num, "...") except Exception as e: print(f"Catch-up: failed to update issue #{issue_num}: {e}") continue # always reached ``` #### File: `scripts/test_agent_loop.py` Add tests in a new class (e.g. `TestCatchUpLoopGuards`): 1. **`_get_issue_labels` raises** → catch-up skips the PR, no merge attempted, no crash. 2. **`_merge_pr` raises `RuntimeError`** → catch-up sets `State/Question` on the issue and posts a comment. 3. **Issue already has `State/Question` and CI is not "success"** (e.g. `pr_run=None`) → guard at top of loop still fires, merge skipped. Follow the existing `unittest.mock.patch` patterns in the test file. ### Risks and open questions 1. **Idempotency of `_set_labels`**: Calling `add=[LABEL_QUESTION]` when the label is already present should be a no-op on Codeberg. The fgj tool exits 0 in this case. No special handling needed. 2. **Should `LABEL_READY` and `LABEL_TO_PLAN` also be removed** when setting `State/Question`? The current code only removes `LABEL_IN_PROGRESS`. If an issue somehow has both `State/Ready` and `State/Question`, it would still appear in `_ready_issues()`. Low-risk in practice but worth removing them defensively. 3. **Section 2b (regular pending-issue flow) has no `State/Question` guard before merge.** If the catch-up writes a `pending-ci` state for an issue that already has `State/Question` (e.g. after a new CI run starts on the PR branch), section 2b will attempt the merge without checking. This is a lower-priority gap because catch-up's `pending-ci` write happens only when CI is still running — after Change A is in place, the catch-up will skip the PR entirely when `State/Question` is set, so the `pending-ci` write won't occur. Mentioning it for completeness.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#239