## Summary
Closes#349
Two bugs prevented `check-dagger` from failing fast when checks failed:
- **Hygiene + Layers checked sequentially** — they are cheap structural checks with no dependency on each other. Running them in parallel (`errgroup.Group`) means failures are reported sooner.
- **Spurious retries from `errgroup.WithContext`** — the backend and integration tests previously shared a derived context via `errgroup.WithContext`. When one test failed, the context was cancelled, causing the sibling test to emit `"context canceled"` in Dagger's `--progress=plain` output. The `retry_dagger` function in `Taskfile.yml` matched that string as a transient network error and re-ran the entire pipeline up to 3 times — a real test failure could take 30+ minutes to be reported instead of ~10.
**Fix in `ci/main.go`:**
- Hygiene + layers now run in parallel with `errgroup.Group`
- Backend + integration tests now use `errgroup.Group` (no shared cancel context), so a failure in one does not emit `"context canceled"` for the other
**Fix in `Taskfile.yml`:**
- Removed `context canceled` from the `retry_dagger` grep pattern; the remaining patterns (`connection reset`, `context deadline exceeded`, `connection refused`, `invalid return status code`) still cover genuine network/engine transients
## Test plan
- [ ] Confirm the Forgejo CI run completes and, when a check fails, it fails fast (no 3× retry loop in logs)
- [ ] Verify `task check-dagger` still retries on actual connection errors
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Thomas SharedInbox <sharedinbox@thomas-guettler.de>
Co-authored-by: guettli <guettli@noreply.codeberg.org>
Reviewed-on: https://codeberg.org/guettli/sharedinbox/pulls/350
Three unguarded blocking calls caused CI to hang until the 60-min timeout:
- dagger query prune steps had no timeout; || true only catches errors, not hangs
- docker info (added in d905cd6) had no timeout if Docker socket is unresponsive
- until portfile loop in check-dagger spun forever if otel-receiver.py crashed
Fixes: timeout 120 on all dagger query prune calls, timeout 30 on docker info,
and a kill -0 process-alive guard on the portfile until loop with fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a Renovate() Dagger function using the forgejo platform and a
.forgejo/workflows/renovate.yml workflow triggered at 06:00 UTC daily.
Uses RENOVATE_FORGEJO_TOKEN secret; no dedicated Renovate service account needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
env:SSH_PRIVATE_KEY passes the key through shell $() which strips the
trailing newline, causing dagger to write a truncated key that OpenSSH
rejects with "error in libcrypto". Using file: reads it directly from
disk, preserving exact content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add scripts/run_firebase_test.sh that strips ANSI codes and removes
UP-TO-DATE task lines, libsqlite warnings, Gradle deprecation notices
and other high-volume noise before it hits the CI log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements issue #132. Builds debug app APK + androidTest APK via Dagger,
then runs them on Firebase Test Lab using the FIREBASE_TEST_LAB_SERVICE_ACCOUNT_KEY
secret and FIREBASE_PROJECT_ID variable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GET /shutdown endpoint on otel-receiver.py is the one clean shutdown
path. cleanup() only needs to remove temp files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename ci/otelrecv.py to ci/otel-receiver.py for readability.
Replace SIGTERM+wait shutdown (which could hang indefinitely) with an
HTTP-based approach: add GET /shutdown to otel-receiver.py that calls
self.server.shutdown() directly. After dagger call returns, curl that
endpoint so the receiver prints its timing report and exits cleanly.
Cleanup is reduced to a SIGKILL fallback in case the process is already
gone.
Also fix the do_GET handler to reference self.server instead of the
local variable server, which was inaccessible from the handler class.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove per-request debug logs from otelrecv.py (POST, decoding,
decoded, 200 sent, signal) that were added to diagnose the CI hang,
which has since been resolved.
Remove verbose [HH:MM:SS] timestamp messages from check-dagger
(start, pipeline done, otelrecv started/ready, final RC, cleanup
start/done) for the same reason.
Fix cleanup to send SIGTERM + wait instead of SIGKILL so the OTEL
timing report is actually printed at the end of each CI run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a Ci.Graph() Dagger function that emits a Mermaid flowchart showing
both the Dagger Check pipeline (toolchain → pubGetLayer → parallel steps)
and the Codeberg CI job dependencies (check → build-linux / deploy-playstore
→ publish-website).
Usage: dagger call -m ci --source=. graph
task ci-graph
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait "$RECV_PID" was blocking despite kill -9 (possibly because $RECV_PID
was garbled by ANSI escape codes from dagger output, making kill target the
wrong PID). Fix:
- Remove wait entirely — zombie is reaped when the shell exits
- Add pkill -9 -f otelrecv.py as fallback in case kill-by-PID misses
- Log PID at capture time to verify correctness in CI logs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three changes:
- cleanup() now uses kill -9 instead of kill (SIGTERM) to prevent wait hanging
if otelrecv's signal handler stalls
- adds [HH:MM:SS] log lines at key points so CI logs show exactly where time is spent
- restores OTEL env vars (via env VAR=val) since they were confirmed not to cause the hang
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger ignores SIGTERM, keeping the pipe's write end open; tee can never
get EOF and the script hangs. --kill-after=10 follows up with SIGKILL which
closes the pipe and unblocks the script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On network errors (connection reset, context canceled, connection refused)
retry the dagger call rather than failing immediately. Real test failures
propagate without retry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger call hangs after function completion due to HTTP/2 teardown bug in
remote engine mode. Capture output via tee; if timeout fires but output
contains "All tests passed", exit 0 instead of 124.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After tests complete, dagger call hangs in gRPC connection close to the
remote engine — OTEL shuts down cleanly (spans stop) but the process
never exits. Wrapping with timeout 900s and treating exit 124 as success
unblocks CI and lets the OTEL timing report print.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
http/json is not supported by the Go OTEL SDK used in Dagger v0.20.8.
Switch to http/protobuf (the SDK default) and rewrite the Python receiver
to decode binary protobuf using stdlib struct — no pip required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger v0.20.8 only supports 'grpc' and 'http/protobuf' OTLP protocols;
'http/json' triggers a WARN and exports nothing. The new approach pipes
dagger's --progress=plain output through a Python script that echoes it
in real-time and prints a timing table at EOF. No HTTP server, no port
files, no protocol issues — works locally and in CI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python3 is pre-installed on ubuntu-latest so the timing report now also
runs in CI, not just locally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TIMINGFILE=$(mktemp) was an unnecessary /tmp path. The receiver already
prints its report to stdout on shutdown; wait $RECV_PID captures it in
place. Only PORTFILE remains in /tmp (unique via mktemp, deleted in cleanup).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds ci/otelrecv/main.go — a minimal OTLP HTTP/JSON trace receiver that
listens on a random port (port 0) so parallel runs never collide.
The check-dagger Taskfile task now starts the receiver in the background,
passes the port via a mktemp file, runs dagger with OTEL env vars set,
then prints a per-span timing report on shutdown. Falls back to plain
dagger call when Go is not available (e.g. CI containers without Go).
First run will show raw attribute keys so we can learn Dagger's exact
telemetry format and refine the cached/live detection logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BuildAndroidRelease() drops all params and builds with --build-number 1
(no keystore injected, Gradle uses debug signing). The command is now
stable across all commits — full Dagger cache hit whenever source is
unchanged.
Three new Dagger functions handle the post-cache steps:
- StampAndroidVersionCode(aab, versionCode): pure-stdlib Python patches
the AAB's compiled manifest proto (android:versionCode resource ID
0x0101021b) and strips META-INF/ to clear the old signature.
- SignAndroidBundle(aab, keystoreBase64, keystorePassword): decodes the
base64 keystore secret and re-signs with jarsigner.
- PublishAndroid(ctx, playStoreConfig, keystoreBase64, keystorePassword):
chains all three + UploadToPlayStore, computing time.Now().Unix() as
the versionCode internally.
Taskfile: build-android-bundle simplified (no keystore params); publish-
android now calls publish-android in a single Dagger call instead of the
two-step build-then-upload.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
$(date +%s) changed every run, making the flutter build WithExec args
unique each time and busting the Dagger layer cache (500s build every run).
$(git log -1 --format=%ct HEAD) is stable for the same commit, so a
retry of a failed upload gets a full cache hit on the build step.
Still monotonically increasing across commits, satisfying Play Store.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old fvm-based task had the same name as the new Dagger-based one,
causing go-task to error immediately (1-second CI failure).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pass --build-number $(date +%s) to flutter build for both APK and AAB
so each CI run gets a unique version code (fixes "already been used" error)
- Extract UploadToPlayStore(aab, playStoreConfig) as its own Dagger function
so the build and upload are independently callable
- Add build-android-bundle task (exports AAB via dagger export) and
upload-android-bundle task (calls UploadToPlayStore with the local file)
- CI deploy-playstore job now has two steps: Build Android Bundle and
Upload to Play Store, so a failed upload can be retried without rebuilding
- deploy-apk also gets --build-number to avoid version code collisions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both BuildAndroidApk and BuildAndroidRelease were using the debug
signing config because the keystore and password were never forwarded
into the Dagger container. Add setupKeystore() helper that decodes
ANDROID_KEYSTORE_BASE64 into android/app/upload-keystore.jks and
sets ANDROID_KEYSTORE_PASSWORD, then wire both secrets through
DeployApk, PublishAndroid, and the Taskfile/CI env blocks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add -q (quiet) flag to all dagger call invocations to suppress INFO-level
engine messages while keeping warnings and errors visible. Set DAGGER_NO_NAG=1
globally to suppress the Dagger Cloud tracing nag line. --progress=plain
is retained on all calls as required.
Refactor the CI pipeline to use WithServiceBinding for the Stalwart mail
server, replacing legacy shell scripts and manual port management.
Introduces pre-seeded data for the Stalwart service to avoid network
hits and improves headless UI testing with Xvfb.
The builds page at /builds/ was empty because generate-build-history
only ran inside deploy-playstore; if that job failed early (e.g. Play
Store secrets not configured) the website was never updated, and the
build-linux job never triggered a website update at all.
Changes:
- generate_build_history.py: extend to cover Linux tarballs in addition
to Android APKs, capped at MAX_BUILDS_PER_PLATFORM (30) each
- Taskfile: add website-publish task (generate-build-history +
website-deploy), exclude *.tar.gz from rsync, update descriptions
- .forgejo/workflows/ci.yml: add publish-website job that waits for
both build-linux and deploy-playstore (using always() so it runs
even when deploy-playstore fails), then removes the duplicate
generate/deploy steps from deploy-playstore
- .github/workflows/ci.yml: add deploy job that deploys Linux build,
generates build history, builds Hugo site, and rsyncs to server
- .gitignore: ignore website/content/builds/_index.md (generated),
Python __pycache__, and widget test failure screenshots
- stalwart-dev/integration_ui_test.sh: use ${USER:-$(id -un)} for
robustness in environments where USER is unset
- scripts/test_generate_build_history.py: unit tests for parse_builds
and render_entries covering both platforms
Generated content (builds/_index.md and per-day pages) is not tracked
in git; it is produced at CI time and rsynced to the server.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add check-mocks task that re-runs build_runner and fails if any
*.mocks.dart file differs from what is committed. Wired into
check-fast (pre-commit) and added as an early CI step so stale
mocks are caught before the full test suite runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds build-windows-release and deploy-windows-to-server Taskfile tasks,
a build-windows CI job (requires a windows-runner self-hosted runner),
and extends updateInfoProvider to also cover Platform.isWindows.
latest.json is now extended with a 'windows' key; both deploy tasks
preserve the other platform's URL when updating the file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Build task embeds GIT_HASH via --dart-define; new deploy-linux-to-server task
packages a tar.gz and updates latest.json on the server. The account list screen
shows a MaterialBanner when a newer Linux build is available.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/generate_build_history.py: SSH into server, list APKs under
public_html/builds/YYYY/MM/DD/, fetch commit titles from Codeberg API,
and write Hugo content pages to website/content/builds/
- Taskfile: add deploy-apk-to-server and generate-build-history tasks;
add --exclude='*.apk' to website-deploy rsync so APKs survive redeploy
- CI: after Play Store deploy, set up SSH key, scp APK, generate history,
then deploy website
- .gitignore: exclude website/content/builds/ (generated at deploy time)
- website/hugo.toml: add Builds nav item
Closes#73
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The (YY-20)mmddHHMM formula generates ~605M for 2026, which is lower
than existing epoch-second deployments (~1.747B). Google Play rejects
version code regressions at commit time (403 Forbidden).
Blocked — see issue #63 for context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces epoch seconds with a compact date-based integer so the Play
Store version code is interpretable by humans while staying below the
2 100 000 000 upper bound until ~2040.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>