Add timeouts to all CI/CD tasks and runner scripts #415
Closed
opened 2026-06-04 07:58:55 +00:00 by guettli
·
2 comments
No Branch/Tag Specified
main
issue-563-agentloop-validation
dummy-pr-test
issue-560-fix-firebase-run-url
issue-539-stable-imap-uid
issue-533-shared-email-list
plan-issue-555
drop-nix
plan-issue-484
plan-issue-539
plan-issue-535
plan-issue-474
plan-issue-533
fix-dagger-engineless-precommit
issue-521-fix-deploy-yml-wait-time-api
issue-502-fix-email-id-collision-mailbox
issue-492-eliminate-duplicate-build-runner
issue-494-website-change-detection
issue-491-parallelize-check
issue-478-fix-stalwart-dual-stack-bind
issue-475-allowed-addresses-glob
issue-473-search-result-reorder
issue-453-update-agentloop-defaults
issue-466-structured-search
issue-505-exclude-chaos-monkey-from-regular-ci
issue-509-fix-search-result-sorting
fix-ink-sparkle-remaining-tests
issue-506-fix-search-emails-tests
issue-504-runner-wait-time
issue-488-search-notes
issue-472-changelog-issue-links
issue-501-folder-search-local-sqlite
issue-486-fix-stale-test-shader-mismatch
fix/prevent-settled-search-rerun-473
issue-467-fix-search-stale-results
issue-446-installed-versions-in-changelog
issue-462-fix-pr
issue-448-chaos-monkey-test
issue-436-notes-on-emails
issue-429-unify-mail-display
issue-422-move-to-folder-create-new
issue-414-ensure-not-run-as-root
issue-424-unify-email-list-views
issue-419-trusted-senders-page
issue-425-fix-prs
test-foo
issue-421-bug-report
issue-383-fix-ci
issue-394-fix-deploy-flutter-version
issue-391-fix-ci-double-trigger
issue-376-combined-inbox-v2
issue-376-combined-inbox
issue-384-fix-open-prs
sops-migrate
issue-339-safe-first-on-imap-fetch
issue-340-try-catch-measure-height
issue-342-pin-intl-version
issue-341-guard-threademails-last
issue-335-agentloop-code-test
issue-329-fix
issue-315-fix
issue-320-fix
issue-325-fix
issue-312-fix
issue-311-fix
issue-305-fix
issue-304-fix
issue-299-fix
issue-300-fix
issue-298-fix
issue-296-fix
issue-294-fix
issue-289-fix
issue-288-fix
issue-287-fix
issue-286-fix
issue-277-fix
issue-282-fix
issue-280-fix
issue-272-fix
issue-268-fix
issue-267-fix
issue-266-fix
issue-258-fix
issue-260-fix
issue-257-fix
issue-253-fix
issue-216-fix
issue-251-fix
issue-249-fix
issue-question-fixes
issue-235-fix
issue-236-fix-v2
issue-237-fix
issue-236-fix
issue-228-fix
issue-217-fix
issue-214-fix
issue-213-fix
issue-208-fix
issue-205-fix
issue-204-fix
issue-203-fix
issue-202-fix
issue-129-fix
issue-161-fix
issue-160-fix
issue-201-fix
issue-210-fix
issue-198-fix
issue-200-fix
issue-144-fix
issue-199-fix
fix/playstore-upload-use-requests
issue-193-fix
issue-186-fix
issue-185-fix
issue-192-fix
issue-183-fix
issue-175-fix
issue-172-fix
issue-171-fix
issue-167-fix
issue-136-fix
issue-162-fix
issue-179-fix
issue-155-fix
issue-154-fix
issue-152-fix
issue-151-fix
issue-141-fix
issue-150-fix
issue-164-fix
migrate-to-dagger
task/d1-ci-matrix
task/a4-typeconverter-json
task/u7-onboarding-walkthrough
task/d3-sync-doc
task/a5-layer-boundary-lint
task/t5-golden-tests
task/p5-date-cache
task/s4-link-handling
task/p3-html-parse-isolate
task/u8-mark-all-read
task/u3-recent-searches
task/a3-jmap-injectable-http-client
task/r5-tls-error-handling
fix/playstore-redirect-retry
task/t3-repository-contract-tests
task/p2-email-list-pagination
task/p1-fts5-search
fix/playstore-upload-timeout
task/a1-email-detail-notifier
fix/upgrade-workmanager-0.9
fix/android-core-library-desugaring
task/p4-db-indexes
task/r3-html-error-boundary
task/d2-check-coverage
task/a2-email-tile
task/t4-migration-tests
task/t2-widget-tests
task/t1-email-repo-coverage
task/u6-connection-status
task/u4-push-notifications
task/u2-draft-sync
task/u1-list-unsubscribe
task/s2-hostname-validation
task/r6-reliability-fuzz-tests
task/r4-sync-error-banner
task/r2-force-resync
task/r1-undo-history-persistence
No results found.
Labels
Clear labels
NeedSupervisor
State/InProgress
State/Later
State/Planned
automerge
ci-failure
do-not-merge
loop/code
loop/code-ci-pending
loop/code-done
loop/code-in-process
loop/merge
loop/merge-done
loop/merge-in-process
loop/plan
loop/plan-done
loop/plan-in-process
Issue escalated to a human supervisor; agentloop will skip it until cleared.
Eligible for automatic merge by CI
Issue opened by agentloop to track a failing CI workflow; used for deduplication.
Plan PR — review only, do not merge.
Add to run the built-in "code" prompt; override at prompts/code.md.
Prompt "code" finished; waiting for the PR's CI to pass before advancing.
Prompt "code" finished successfully.
Agent for the "code" prompt is currently running on this issue.
Managed by agentloop
Managed by agentloop
Managed by agentloop
Add to run the built-in "plan" prompt; override at prompts/plan.md.
Prompt "plan" finished successfully.
Agent for the "plan" prompt is currently running on this issue.
No labels
loop/code-done
Milestone
No items
No Milestone
Projects
Clear projects
No projects
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: guettli/sharedinbox#415
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Add timeouts to all CI/CD tasks and runner scripts
Goal
Implement execution time limits on all automation tasks, wrapper scripts, and CI/CD workflow jobs to prevent hung or looping processes from running indefinitely.
Action Items
timeout-minutes) to all Forgejo Action workflows (such asci.yml,deploy.yml,firebase-tests.yml,renovate.yml).dagger callor remote SSH executions) with the Linuxtimeoutcommand in runner scripts (e.g.,timeout 30m dagger call ...).Now I have a complete picture. Here is the implementation plan:
Implementation Plan: Add Timeouts to All CI/CD Tasks and Runner Scripts
Current State
After auditing all workflows and scripts:
deploy.yml(all 5 jobs),firebase-tests.yml(both jobs),renovate.yml,website.ymlci.yml→checkjob;windows-nightly.yml→windows-nightlyjob (currently disabled withif: false)ssh-keyscanandssh -f -N -Ltunnel creation inscripts/setup_dagger_remote.sh(the subsequentdagger core --helpalready usestimeout 45)dagger callinvocations without timeouts:test-backend,integration-ui,sync-reliability,ci-graph,deploy-linux,build-android-bundle,upload-android-bundle,publish-android,deploy-apk,publish-website;check-daggeralready usestimeout --kill-after=10 600✅scripts/run_firebase_test.sh: Thedagger callinside_run()has no per-attempt timeoutStep 1 — Add missing
timeout-minutesto workflow jobsFile:
.forgejo/workflows/ci.ymlAdd
timeout-minutes: 60to thecheckjob. The innercheck-daggertask already enforces a 600 s (10 min) Dagger timeout with up to 3 retries, so 60 min is a safe ceiling that also covers checkout and Dagger setup overhead.File:
.forgejo/workflows/windows-nightly.ymlAdd
timeout-minutes: 90to thewindows-nightlyjob. The job is currently gated withif: false(no runner registered), but adding the timeout now means it is correctly bounded when a Windows runner is eventually registered. 90 min accounts for slower Windows Flutter builds.Step 2 — Add timeouts to network operations in
scripts/setup_dagger_remote.shTwo calls can hang indefinitely if the remote host is unreachable:
ssh-keyscan— wrap withtimeout 30:timeout 30 ssh-keyscan -H "$DAGGER_ENGINE_HOST" >> ~/.ssh/known_hosts 2>/dev/nulltimeout 30:timeout 30 ssh -i ~/.ssh/dagger_key -o StrictHostKeyChecking=no -f -N -L 8080:localhost:1774 "dagger@$DAGGER_ENGINE_HOST". The-fflag causessshto background itself once the tunnel is established, so the foreground process exits quickly on success;timeout 30catches the case where the connection never completes.On failure of either call, the existing
set -euo pipefailwill abort the script with a clear error.Step 3 — Wrap
dagger callin Taskfile tasksApply
timeout --kill-after=10 <N>beforedagger callin each task. Use--kill-after=10so a SIGKILL follows 10 s after SIGTERM if Dagger does not respond (mirroring the existingcheck-daggerpattern).Timeouts by category:
test-backend,integration-ui,sync-reliability,ci-graph600s (10 min)deploy-linux,publish-android,deploy-apk,build-android-bundle,upload-android-bundle,publish-website1800s (30 min)Exclusions:
stalwart— intentionally long-running dev server; do not add a timeoutcheck-dagger— already has its own timeout + retry logic ✅Step 4 — Add per-attempt timeout to
scripts/run_firebase_test.shInside the
_run()function, wrap thedagger callwithtimeout --kill-after=10 2400(40 min per attempt). Firebase Test Lab jobs can take 20–30 min; 40 min per attempt with up to 3 retries stays well under the 60 min job-level ceiling infirebase-tests.yml.Step 5 — Monitoring / warnings for long-running tasks
The issue requests "basic logging/warnings for tasks that run longer than expected." Minimal, idiomatic additions:
setup_dagger_remote.sh: After the tunnel is established, print the elapsed time of each network call usingSECONDS(built-in bash variable). Emit a::warning::line ifssh-keyscanor tunnel setup takes more than 10 s.timeoutitself — when the limit is hit,timeoutexits with code 124 and the task name in the output makes it self-evident which step hung. The existingcheck-daggerpattern of emitting a message on exit-124 (e.g. "hung in teardown after success; treating as exit 0") can be reused if that edge case applies to other tasks.timeout-minutesvalues make the hard limit explicit without any additional logging code.Files to change (summary)
.forgejo/workflows/ci.ymltimeout-minutes: 60tocheckjob.forgejo/workflows/windows-nightly.ymltimeout-minutes: 90towindows-nightlyjobscripts/setup_dagger_remote.shtimeout 30onssh-keyscan;timeout 30onssh -f -N -L; elapsed-time warningsTaskfile.ymltimeout --kill-after=10 <N>on all baredagger callinvocations (10 tasks)scripts/run_firebase_test.shtimeout --kill-after=10 2400inside_run()arounddagger callThe PR is open at https://codeberg.org/guettli/sharedinbox/pulls/432.
Here's what was implemented for issue #415:
Workflow jobs — added missing
timeout-minutes:ci.yml→checkjob: 60 minwindows-nightly.yml→windows-nightlyjob: 90 min (ready for when a Windows runner is registered)scripts/setup_dagger_remote.sh— two network calls that could hang indefinitely are now wrapped withtimeout 30:ssh-keyscan(host key retrieval)ssh -f -N -L(background tunnel creation)Both emit a
::warning::annotation if they take more than 10 s.Taskfile.yml— all baredagger callinvocations are now wrapped withtimeout --kill-after=10 <N>(600 s for test/query tasks, 1800 s for build/deploy tasks, 60 s forci-graph).stalwart(intentional long-running dev server) andcheck-dagger(already had its own timeout+retry loop) were left untouched.scripts/run_firebase_test.sh— addedtimeout --kill-after=10 2400per attempt arounddagger call; the retry loop now detects exit 124 and emits a::warning::annotation instead of silently retrying.