Add timeouts to all CI/CD tasks and runner scripts #415

Closed
opened 2026-06-04 07:58:55 +00:00 by guettli · 2 comments
guettli commented 2026-06-04 07:58:55 +00:00 (Migrated from codeberg.org)

Add timeouts to all CI/CD tasks and runner scripts

Goal

Implement execution time limits on all automation tasks, wrapper scripts, and CI/CD workflow jobs to prevent hung or looping processes from running indefinitely.

Action Items

  • Workflow Timeouts: Add explicit job-level timeouts (timeout-minutes) to all Forgejo Action workflows (such as ci.yml, deploy.yml, firebase-tests.yml, renovate.yml).
  • Process Timeouts: Wrap long-running or network-dependent commands (like dagger call or remote SSH executions) with the Linux timeout command in runner scripts (e.g., timeout 30m dagger call ...).
  • Monitoring/Cleanup: Add basic logging/warnings for tasks that run longer than expected.
# Add timeouts to all CI/CD tasks and runner scripts ## Goal Implement execution time limits on all automation tasks, wrapper scripts, and CI/CD workflow jobs to prevent hung or looping processes from running indefinitely. ## Action Items - **Workflow Timeouts:** Add explicit job-level timeouts (`timeout-minutes`) to all Forgejo Action workflows (such as `ci.yml`, `deploy.yml`, `firebase-tests.yml`, `renovate.yml`). - **Process Timeouts:** Wrap long-running or network-dependent commands (like `dagger call` or remote SSH executions) with the Linux `timeout` command in runner scripts (e.g., `timeout 30m dagger call ...`). - **Monitoring/Cleanup:** Add basic logging/warnings for tasks that run longer than expected.
guettlibot commented 2026-06-04 08:05:03 +00:00 (Migrated from codeberg.org)

Now I have a complete picture. Here is the implementation plan:


Implementation Plan: Add Timeouts to All CI/CD Tasks and Runner Scripts

Current State

After auditing all workflows and scripts:

  • Already have timeouts: deploy.yml (all 5 jobs), firebase-tests.yml (both jobs), renovate.yml, website.yml
  • Missing timeouts: ci.ymlcheck job; windows-nightly.ymlwindows-nightly job (currently disabled with if: false)
  • Script network calls without timeouts: ssh-keyscan and ssh -f -N -L tunnel creation in scripts/setup_dagger_remote.sh (the subsequent dagger core --help already uses timeout 45)
  • Taskfile dagger call invocations without timeouts: test-backend, integration-ui, sync-reliability, ci-graph, deploy-linux, build-android-bundle, upload-android-bundle, publish-android, deploy-apk, publish-website; check-dagger already uses timeout --kill-after=10 600
  • scripts/run_firebase_test.sh: The dagger call inside _run() has no per-attempt timeout

Step 1 — Add missing timeout-minutes to workflow jobs

File: .forgejo/workflows/ci.yml

Add timeout-minutes: 60 to the check job. The inner check-dagger task already enforces a 600 s (10 min) Dagger timeout with up to 3 retries, so 60 min is a safe ceiling that also covers checkout and Dagger setup overhead.

File: .forgejo/workflows/windows-nightly.yml

Add timeout-minutes: 90 to the windows-nightly job. The job is currently gated with if: false (no runner registered), but adding the timeout now means it is correctly bounded when a Windows runner is eventually registered. 90 min accounts for slower Windows Flutter builds.


Step 2 — Add timeouts to network operations in scripts/setup_dagger_remote.sh

Two calls can hang indefinitely if the remote host is unreachable:

  1. ssh-keyscan — wrap with timeout 30: timeout 30 ssh-keyscan -H "$DAGGER_ENGINE_HOST" >> ~/.ssh/known_hosts 2>/dev/null
  2. SSH tunnel creation — wrap with timeout 30: timeout 30 ssh -i ~/.ssh/dagger_key -o StrictHostKeyChecking=no -f -N -L 8080:localhost:1774 "dagger@$DAGGER_ENGINE_HOST". The -f flag causes ssh to background itself once the tunnel is established, so the foreground process exits quickly on success; timeout 30 catches the case where the connection never completes.

On failure of either call, the existing set -euo pipefail will abort the script with a clear error.


Step 3 — Wrap dagger call in Taskfile tasks

Apply timeout --kill-after=10 <N> before dagger call in each task. Use --kill-after=10 so a SIGKILL follows 10 s after SIGTERM if Dagger does not respond (mirroring the existing check-dagger pattern).

Timeouts by category:

Task Timeout Rationale
test-backend, integration-ui, sync-reliability, ci-graph 600 s (10 min) Test/query pipelines; the CI job already caps at 60 min
deploy-linux, publish-android, deploy-apk, build-android-bundle, upload-android-bundle, publish-website 1800 s (30 min) Build + deploy pipelines; CI job caps at 60 min

Exclusions:

  • stalwart — intentionally long-running dev server; do not add a timeout
  • check-dagger — already has its own timeout + retry logic

Step 4 — Add per-attempt timeout to scripts/run_firebase_test.sh

Inside the _run() function, wrap the dagger call with timeout --kill-after=10 2400 (40 min per attempt). Firebase Test Lab jobs can take 20–30 min; 40 min per attempt with up to 3 retries stays well under the 60 min job-level ceiling in firebase-tests.yml.


Step 5 — Monitoring / warnings for long-running tasks

The issue requests "basic logging/warnings for tasks that run longer than expected." Minimal, idiomatic additions:

  • setup_dagger_remote.sh: After the tunnel is established, print the elapsed time of each network call using SECONDS (built-in bash variable). Emit a ::warning:: line if ssh-keyscan or tunnel setup takes more than 10 s.
  • Taskfile tasks: No new logging needed beyond timeout itself — when the limit is hit, timeout exits with code 124 and the task name in the output makes it self-evident which step hung. The existing check-dagger pattern of emitting a message on exit-124 (e.g. "hung in teardown after success; treating as exit 0") can be reused if that edge case applies to other tasks.
  • Workflows: Forgejo Actions already logs step wall-clock times in the run UI; the new timeout-minutes values make the hard limit explicit without any additional logging code.

Files to change (summary)

File Change
.forgejo/workflows/ci.yml Add timeout-minutes: 60 to check job
.forgejo/workflows/windows-nightly.yml Add timeout-minutes: 90 to windows-nightly job
scripts/setup_dagger_remote.sh timeout 30 on ssh-keyscan; timeout 30 on ssh -f -N -L; elapsed-time warnings
Taskfile.yml timeout --kill-after=10 <N> on all bare dagger call invocations (10 tasks)
scripts/run_firebase_test.sh timeout --kill-after=10 2400 inside _run() around dagger call
Now I have a complete picture. Here is the implementation plan: --- ## Implementation Plan: Add Timeouts to All CI/CD Tasks and Runner Scripts ### Current State After auditing all workflows and scripts: - **Already have timeouts:** `deploy.yml` (all 5 jobs), `firebase-tests.yml` (both jobs), `renovate.yml`, `website.yml` - **Missing timeouts:** `ci.yml` → `check` job; `windows-nightly.yml` → `windows-nightly` job (currently disabled with `if: false`) - **Script network calls without timeouts:** `ssh-keyscan` and `ssh -f -N -L` tunnel creation in `scripts/setup_dagger_remote.sh` (the subsequent `dagger core --help` already uses `timeout 45`) - **Taskfile `dagger call` invocations without timeouts:** `test-backend`, `integration-ui`, `sync-reliability`, `ci-graph`, `deploy-linux`, `build-android-bundle`, `upload-android-bundle`, `publish-android`, `deploy-apk`, `publish-website`; `check-dagger` already uses `timeout --kill-after=10 600` ✅ - **`scripts/run_firebase_test.sh`:** The `dagger call` inside `_run()` has no per-attempt timeout --- ### Step 1 — Add missing `timeout-minutes` to workflow jobs **File: `.forgejo/workflows/ci.yml`** Add `timeout-minutes: 60` to the `check` job. The inner `check-dagger` task already enforces a 600 s (10 min) Dagger timeout with up to 3 retries, so 60 min is a safe ceiling that also covers checkout and Dagger setup overhead. **File: `.forgejo/workflows/windows-nightly.yml`** Add `timeout-minutes: 90` to the `windows-nightly` job. The job is currently gated with `if: false` (no runner registered), but adding the timeout now means it is correctly bounded when a Windows runner is eventually registered. 90 min accounts for slower Windows Flutter builds. --- ### Step 2 — Add timeouts to network operations in `scripts/setup_dagger_remote.sh` Two calls can hang indefinitely if the remote host is unreachable: 1. **`ssh-keyscan`** — wrap with `timeout 30`: `timeout 30 ssh-keyscan -H "$DAGGER_ENGINE_HOST" >> ~/.ssh/known_hosts 2>/dev/null` 2. **SSH tunnel creation** — wrap with `timeout 30`: `timeout 30 ssh -i ~/.ssh/dagger_key -o StrictHostKeyChecking=no -f -N -L 8080:localhost:1774 "dagger@$DAGGER_ENGINE_HOST"`. The `-f` flag causes `ssh` to background itself once the tunnel is established, so the foreground process exits quickly on success; `timeout 30` catches the case where the connection never completes. On failure of either call, the existing `set -euo pipefail` will abort the script with a clear error. --- ### Step 3 — Wrap `dagger call` in Taskfile tasks Apply `timeout --kill-after=10 <N>` before `dagger call` in each task. Use `--kill-after=10` so a SIGKILL follows 10 s after SIGTERM if Dagger does not respond (mirroring the existing `check-dagger` pattern). Timeouts by category: | Task | Timeout | Rationale | |---|---|---| | `test-backend`, `integration-ui`, `sync-reliability`, `ci-graph` | `600` s (10 min) | Test/query pipelines; the CI job already caps at 60 min | | `deploy-linux`, `publish-android`, `deploy-apk`, `build-android-bundle`, `upload-android-bundle`, `publish-website` | `1800` s (30 min) | Build + deploy pipelines; CI job caps at 60 min | **Exclusions:** - `stalwart` — intentionally long-running dev server; do not add a timeout - `check-dagger` — already has its own timeout + retry logic ✅ --- ### Step 4 — Add per-attempt timeout to `scripts/run_firebase_test.sh` Inside the `_run()` function, wrap the `dagger call` with `timeout --kill-after=10 2400` (40 min per attempt). Firebase Test Lab jobs can take 20–30 min; 40 min per attempt with up to 3 retries stays well under the 60 min job-level ceiling in `firebase-tests.yml`. --- ### Step 5 — Monitoring / warnings for long-running tasks The issue requests "basic logging/warnings for tasks that run longer than expected." Minimal, idiomatic additions: - **`setup_dagger_remote.sh`:** After the tunnel is established, print the elapsed time of each network call using `SECONDS` (built-in bash variable). Emit a `::warning::` line if `ssh-keyscan` or tunnel setup takes more than 10 s. - **Taskfile tasks:** No new logging needed beyond `timeout` itself — when the limit is hit, `timeout` exits with code 124 and the task name in the output makes it self-evident which step hung. The existing `check-dagger` pattern of emitting a message on exit-124 (e.g. "hung in teardown after success; treating as exit 0") can be reused if that edge case applies to other tasks. - **Workflows:** Forgejo Actions already logs step wall-clock times in the run UI; the new `timeout-minutes` values make the hard limit explicit without any additional logging code. --- ### Files to change (summary) | File | Change | |---|---| | `.forgejo/workflows/ci.yml` | Add `timeout-minutes: 60` to `check` job | | `.forgejo/workflows/windows-nightly.yml` | Add `timeout-minutes: 90` to `windows-nightly` job | | `scripts/setup_dagger_remote.sh` | `timeout 30` on `ssh-keyscan`; `timeout 30` on `ssh -f -N -L`; elapsed-time warnings | | `Taskfile.yml` | `timeout --kill-after=10 <N>` on all bare `dagger call` invocations (10 tasks) | | `scripts/run_firebase_test.sh` | `timeout --kill-after=10 2400` inside `_run()` around `dagger call` |
guettlibot commented 2026-06-04 20:45:02 +00:00 (Migrated from codeberg.org)

The PR is open at https://codeberg.org/guettli/sharedinbox/pulls/432.


Here's what was implemented for issue #415:

Workflow jobs — added missing timeout-minutes:

  • ci.ymlcheck job: 60 min
  • windows-nightly.ymlwindows-nightly job: 90 min (ready for when a Windows runner is registered)

scripts/setup_dagger_remote.sh — two network calls that could hang indefinitely are now wrapped with timeout 30:

  • ssh-keyscan (host key retrieval)
  • ssh -f -N -L (background tunnel creation)
    Both emit a ::warning:: annotation if they take more than 10 s.

Taskfile.yml — all bare dagger call invocations are now wrapped with timeout --kill-after=10 <N> (600 s for test/query tasks, 1800 s for build/deploy tasks, 60 s for ci-graph). stalwart (intentional long-running dev server) and check-dagger (already had its own timeout+retry loop) were left untouched.

scripts/run_firebase_test.sh — added timeout --kill-after=10 2400 per attempt around dagger call; the retry loop now detects exit 124 and emits a ::warning:: annotation instead of silently retrying.

The PR is open at **https://codeberg.org/guettli/sharedinbox/pulls/432**. --- Here's what was implemented for issue #415: **Workflow jobs** — added missing `timeout-minutes`: - `ci.yml` → `check` job: 60 min - `windows-nightly.yml` → `windows-nightly` job: 90 min (ready for when a Windows runner is registered) **`scripts/setup_dagger_remote.sh`** — two network calls that could hang indefinitely are now wrapped with `timeout 30`: - `ssh-keyscan` (host key retrieval) - `ssh -f -N -L` (background tunnel creation) Both emit a `::warning::` annotation if they take more than 10 s. **`Taskfile.yml`** — all bare `dagger call` invocations are now wrapped with `timeout --kill-after=10 <N>` (600 s for test/query tasks, 1800 s for build/deploy tasks, 60 s for `ci-graph`). `stalwart` (intentional long-running dev server) and `check-dagger` (already had its own timeout+retry loop) were left untouched. **`scripts/run_firebase_test.sh`** — added `timeout --kill-after=10 2400` per attempt around `dagger call`; the retry loop now detects exit 124 and emits a `::warning::` annotation instead of silently retrying.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: guettli/sharedinbox#415