security: verify PID belongs to claude before SIGKILL (PID recycling risk) #160

Closed
opened 2026-05-23 08:55:18 +00:00 by guettlibot · 18 comments
guettlibot commented 2026-05-23 08:55:18 +00:00 (Migrated from codeberg.org)

Why SIGKILL is needed

The agent loop kills long-running agents (> 1 h) to prevent a runaway Claude session from holding the loop hostage indefinitely. Without the kill, no new issue agent could ever start on that machine again.

The kill path in `_run_loop()` (agent_loop.py ~line 368):

```python
_kill_agent(state) # os.kill(pid, 9)
```

`_kill_agent` sends SIGKILL to the PID stored in `~/.sharedinbox-agent-state.json`.

The risk

Linux recycles PIDs. If the original `claude` process died between the previous cron tick and the current one, the stored PID may now belong to a completely different process — a system daemon, another user's process, or any other program.

`_agent_alive` (os.kill(pid, 0)) and `_kill_agent` (os.kill(pid, 9)) are two separate syscalls with no atomicity guarantee, so there is a TOCTOU window even when the process exists at check time.

Additionally, the state file is currently written world-readable (`0664`), meaning any local process could craft a `{"pid": 1, ...}` entry to make the loop send SIGKILL to init. (The state file permissions were fixed in b6a2f91, making this specific escalation moot — but the recycling race remains.)

Fix

Before calling `os.kill(pid, 9)`, verify the process is actually `claude`:

```python
def _is_claude_process(pid: int) -> bool:
try:
comm = Path(f"/proc/{pid}/comm").read_text().strip()
return comm in ("claude", "node") # claude may run as node
except OSError:
return False
```

Then in `_kill_agent`:
```python
def _kill_agent(state: dict) -> None:
pid = state.get("pid")
if pid and _is_claude_process(pid):
try:
os.kill(pid, 9)
except ProcessLookupError:
pass
elif pid:
print(f"WARNING: pid {pid} is not a claude process — skipping kill to avoid hitting recycled PID")
```

Also consider storing `/proc/{pid}/stat` field 22 (starttime) at agent launch and comparing it before killing, for a stronger identity check.

## Why SIGKILL is needed The agent loop kills long-running agents (> 1 h) to prevent a runaway Claude session from holding the loop hostage indefinitely. Without the kill, no new issue agent could ever start on that machine again. The kill path in \`_run_loop()\` (agent_loop.py ~line 368): \`\`\`python _kill_agent(state) # os.kill(pid, 9) \`\`\` \`_kill_agent\` sends SIGKILL to the PID stored in \`~/.sharedinbox-agent-state.json\`. ## The risk Linux recycles PIDs. If the original \`claude\` process died between the previous cron tick and the current one, the stored PID may now belong to a completely different process — a system daemon, another user's process, or any other program. \`_agent_alive\` (os.kill(pid, 0)) and \`_kill_agent\` (os.kill(pid, 9)) are two separate syscalls with no atomicity guarantee, so there is a TOCTOU window even when the process exists at check time. Additionally, the state file is currently written world-readable (\`0664\`), meaning any local process could craft a \`{"pid": 1, ...}\` entry to make the loop send SIGKILL to init. (The state file permissions were fixed in b6a2f91, making this specific escalation moot — but the recycling race remains.) ## Fix Before calling \`os.kill(pid, 9)\`, verify the process is actually \`claude\`: \`\`\`python def _is_claude_process(pid: int) -> bool: try: comm = Path(f"/proc/{pid}/comm").read_text().strip() return comm in ("claude", "node") # claude may run as node except OSError: return False \`\`\` Then in \`_kill_agent\`: \`\`\`python def _kill_agent(state: dict) -> None: pid = state.get("pid") if pid and _is_claude_process(pid): try: os.kill(pid, 9) except ProcessLookupError: pass elif pid: print(f"WARNING: pid {pid} is not a claude process — skipping kill to avoid hitting recycled PID") \`\`\` Also consider storing \`/proc/{pid}/stat\` field 22 (starttime) at agent launch and comparing it before killing, for a stronger identity check.
guettlibot commented 2026-05-23 09:20:10 +00:00 (Migrated from codeberg.org)

Agent opened PR #163 but no CI run appeared on branch issue-160-fix after 17 min. The agent may not have pushed any commits. Please investigate and resume manually.

Agent opened PR #163 but no CI run appeared on branch `issue-160-fix` after 17 min. The agent may not have pushed any commits. Please investigate and resume manually.
guettlibot commented 2026-05-23 14:30:18 +00:00 (Migrated from codeberg.org)

Agent opened PR #163 but no CI run appeared on branch issue-160-fix after 328 min. The agent may not have pushed any commits. Please investigate and resume manually.

Agent opened PR #163 but no CI run appeared on branch `issue-160-fix` after 328 min. The agent may not have pushed any commits. Please investigate and resume manually.
guettlibot commented 2026-05-23 15:35:06 +00:00 (Migrated from codeberg.org)

Agent opened PR #163 but no CI run appeared on branch issue-160-fix after 392 min. The agent may not have pushed any commits. Please investigate and resume manually.

Agent opened PR #163 but no CI run appeared on branch `issue-160-fix` after 392 min. The agent may not have pushed any commits. Please investigate and resume manually.
guettlibot commented 2026-05-24 08:50:07 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:10:09 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:10:44 +00:00 (Migrated from codeberg.org)

Reopened: this issue was incorrectly closed by the agent loop's catch-up scan, which called _close_issue(160) after a merge attempt that silently exited 0 — but PR #163 was never actually merged (it is still open). The loop has been patched to verify the merge succeeded before closing the issue.

Reopened: this issue was incorrectly closed by the agent loop's catch-up scan, which called `_close_issue(160)` after a merge attempt that silently exited 0 — but PR #163 was never actually merged (it is still open). The loop has been patched to verify the merge succeeded before closing the issue.
guettlibot commented 2026-05-24 09:15:08 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:20:08 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:25:05 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:30:11 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:35:06 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:40:09 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 09:45:09 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 10:00:16 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 10:10:08 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 10:15:08 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 10:20:06 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
guettlibot commented 2026-05-24 10:25:07 +00:00 (Migrated from codeberg.org)

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.

Automatic merge of PR #163 failed (PR is still open after the merge command). Please merge manually.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#160