fix: fail fast in CI — parallel hygiene/layer checks, no spurious retries #350

Merged
guettlibot merged 2 commits from refs/pull/350/head into main 2026-06-03 11:07:38 +00:00
guettlibot commented 2026-06-02 04:46:22 +00:00 (Migrated from codeberg.org)

Summary

Closes #349

Two bugs prevented check-dagger from failing fast when checks failed:

  • Hygiene + Layers checked sequentially — they are cheap structural checks with no dependency on each other. Running them in parallel (errgroup.Group) means failures are reported sooner.
  • Spurious retries from errgroup.WithContext — the backend and integration tests previously shared a derived context via errgroup.WithContext. When one test failed, the context was cancelled, causing the sibling test to emit "context canceled" in Dagger's --progress=plain output. The retry_dagger function in Taskfile.yml matched that string as a transient network error and re-ran the entire pipeline up to 3 times — a real test failure could take 30+ minutes to be reported instead of ~10.

Fix in ci/main.go:

  • Hygiene + layers now run in parallel with errgroup.Group
  • Backend + integration tests now use errgroup.Group (no shared cancel context), so a failure in one does not emit "context canceled" for the other

Fix in Taskfile.yml:

  • Removed context canceled from the retry_dagger grep pattern; the remaining patterns (connection reset, context deadline exceeded, connection refused, invalid return status code) still cover genuine network/engine transients

Test plan

  • Confirm the Forgejo CI run completes and, when a check fails, it fails fast (no 3× retry loop in logs)
  • Verify task check-dagger still retries on actual connection errors

🤖 Generated with Claude Code

## Summary Closes #349 Two bugs prevented `check-dagger` from failing fast when checks failed: - **Hygiene + Layers checked sequentially** — they are cheap structural checks with no dependency on each other. Running them in parallel (`errgroup.Group`) means failures are reported sooner. - **Spurious retries from `errgroup.WithContext`** — the backend and integration tests previously shared a derived context via `errgroup.WithContext`. When one test failed, the context was cancelled, causing the sibling test to emit `"context canceled"` in Dagger's `--progress=plain` output. The `retry_dagger` function in `Taskfile.yml` matched that string as a transient network error and re-ran the entire pipeline up to 3 times — a real test failure could take 30+ minutes to be reported instead of ~10. **Fix in `ci/main.go`:** - Hygiene + layers now run in parallel with `errgroup.Group` - Backend + integration tests now use `errgroup.Group` (no shared cancel context), so a failure in one does not emit `"context canceled"` for the other **Fix in `Taskfile.yml`:** - Removed `context canceled` from the `retry_dagger` grep pattern; the remaining patterns (`connection reset`, `context deadline exceeded`, `connection refused`, `invalid return status code`) still cover genuine network/engine transients ## Test plan - [ ] Confirm the Forgejo CI run completes and, when a check fails, it fails fast (no 3× retry loop in logs) - [ ] Verify `task check-dagger` still retries on actual connection errors 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.