The plain `prune` operation only removes unreferenced execution-cache
entries; named cache volumes (gradle-cache, go-build-cache, go-mod-cache,
pip-cache) are never reclaimed and grow without bound, causing the
dagger-data Docker volume to reach 200 GB over time.
Pass maxUsedSpace/targetSpace arguments to `prune` so Dagger also
reclaims named volumes when the total cache exceeds 75 GB, targeting
50 GB. Use an aggressive targetSpace of 20 GB for disk-space emergency
recovery in check-dagger. Add a dagger-prune task for manual use.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 3-attempt retry loop around the resumable AAB upload that catches
httplib2.error.RedirectMissingLocation (a transient network error) and
retries with exponential backoff (10s, 20s). A fresh MediaFileUpload is
created on each attempt because resumable upload objects cannot be reused
after failure. Also adds TestUploadRetry covering the retry path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wrap the resumable bundle upload in a loop of up to _MAX_UPLOAD_ATTEMPTS (3)
attempts. On httplib2.error.RedirectMissingLocation, recreate MediaFileUpload
(resumable uploads cannot reuse the same object) and wait 10 s / 20 s before
retrying. After all attempts are exhausted, raise RuntimeError chained to the
last exception. Add tests covering the retry path, backoff delays, fresh
MediaFileUpload on each attempt, and exhaustion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch deploy_playstore.py from requests/AuthorizedSession to the
googleapiclient.discovery client with google-auth-httplib2, so that
AuthorizedHttp(timeout=300) enforces a hard socket timeout on all
requests and num_retries=3 on every .execute() call enables automatic
retries for transient failures.
Update flake.nix and ci/main.go to install the new dependencies
(google-api-python-client, google-auth-httplib2, httplib2) instead of
the old google-auth + requests pair.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace local `task publish-website` invocation with `fgj actions workflow run website.yml`
so the deploy runs in CI rather than on the local machine. Remove failure-tracking state
files and issue-creation logic — Forgejo Actions handles its own reporting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues from #179:
- crash_screen.dart now reads GIT_HASH compile-time constant and includes
'Git Commit: <hash>' in both the on-screen UI and the copied report, so
crash reports always show the exact build that crashed.
- _resolveDatabasePath() retry delays extended from [100, 300, 600] ms
(total ~1 s, 4 attempts) to [200, 500, 1000, 2000, 4000] ms (total
~7.7 s, 6 attempts) to handle slow/non-standard Android devices where
the path_provider Pigeon channel takes several seconds to become ready.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Increases the retry delays in `_resolveDatabasePath()` from `[100, 300, 600]` ms (~1 s) to `[200, 500, 1000, 2000]` ms (~3.7 s).
- Adds a regression test (`test/unit/database_path_test.dart`) that verifies `initDatabasePath()` does not throw when the `path_provider` channel is unavailable.
## Root cause
On some slow Android devices (e.g. the Motorola reported in #166), the `path_provider` Pigeon channel is not ready even several seconds after `runApp()` returns. The previous back-off budget of ~1 s was not enough, causing `_resolveDatabasePath()` to exhaust all retries and throw a `PlatformException`, crashing the app with the message shown in the issue.
## Test plan
- [ ] `flutter test test/unit/database_path_test.dart` passes (new regression test)
- [ ] `flutter test test/unit/` — all 325 unit tests pass
- [ ] `flutter analyze` — no issues
Fixes#166
Co-authored-by: Thomas SharedInbox <sharedinbox@thomas-guettler.de>
Reviewed-on: https://codeberg.org/guettli/sharedinbox/pulls/169
The Forgejo workflow_runs API has no head_branch field. For pull_request
events the branch lives in event_payload["pull_request"]["head"]["ref"];
for push events it is in prettyref. The old code used run.get("head_branch")
which always returned None, causing _latest_ci_run_for_branch to never find
the run and the loop to declare "no CI run after 15 min" and set the issue to
State/Question — even when CI had already passed.
Also fixes a pre-existing test mock that was missing the session_name kwarg.
Adds TestLatestCiRunForBranch covering both event types and the regression.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
env:SSH_PRIVATE_KEY passes the key through shell $() which strips the
trailing newline, causing dagger to write a truncated key that OpenSSH
rejects with "error in libcrypto". Using file: reads it directly from
disk, preserving exact content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Dagger container running generate_build_history.py may not always
reach the deployment server (network constraints on the Dagger engine).
Rather than aborting the entire publish-website pipeline, log the SSH
verbose output (already added in the previous debug commit) and return
an empty file list so Hugo still builds and rsync still deploys the
site — just without updated build-history pages.
This unblocks the cron deploy that has been failing since c259d2da.
Temporary: print verbose SSH output on failure to identify why the
connection fails from inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger mounts the secret file with 0600 but the parent directory may
get created with world-readable permissions, causing SSH to refuse
the key with exit 255.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All other ssh/scp calls in the dagger module use explicit -i /root/.ssh/id_ed25519.
This one was missing it, causing exit 255 inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger parses .env directly and fails on multiline quoted values.
Move SSH_PRIVATE_KEY out of .env and export it from ~/.ssh/id_ed25519
in the wrapper instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tracks consecutive failure count in .fail_count. On the 5th failure
for the same SHA, creates a Prio/High + State/Ready Codeberg issue.
Before creating, checks local .last_issue_sha and queries Codeberg
open issues to avoid duplicates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the last deploy failed and origin/main has not advanced, opens a
Prio/High + State/Ready issue via tea with the failing SHA, commit link,
and captured deploy output. Skips duplicate issues (tracked by
.last_issue_sha). Cron interval changed to */5.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: create log dir with mode 0700 and enforce it on
existing dirs; open log files with mode 0600; chmod state file
to 0600 after every write. Prevents other local processes from
reading agent output (which may contain credential paths) or
tampering with the state file's pid field.
- ci/main.go (TestAndroidFirebase): replace
echo "$FIREBASE_SA_KEY" > /tmp/key.json
with bash process substitution
--key-file=<(echo "$FIREBASE_SA_KEY")
The key is now passed via a file descriptor — it never touches
disk, so it cannot be stranded by a failed gcloud auth call or
snapshotted into the Dagger layer cache.
- ci.yml / deploy.yml: add "Cleanup TLS credentials" step
(if: always()) at the end of every job that calls
setup_dagger_remote.sh. Removes /tmp/dagger-tls,
/tmp/stunnel-dagger.conf, /tmp/stunnel.pid from the self-hosted
runner after each job, so client certs do not accumulate between
job runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get is pure Dart — it never invokes Gradle. The mutable
gradle-cache volume mount caused the same execution-cache instability
we just fixed for the pub cache: Dagger sees a changed volume and
cache-misses pubGetLayer() on every run.
The Gradle cache stays in Base(), which is only used for steps that
actually build Android code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The mutable flutter-pub-cache volume made the execution cache key unstable —
pub get cache-missed every run because the volume's mutable layer changed the
snapshot hash. Removing the volume lets Dagger snapshot packages inside the
execution-cache layer, which is stable and reclaimable via dagger prune.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On some Android versions the path_provider Pigeon channel
('dev.flutter.pigeon.path_provider_android.PathProviderApi.getApplicationSupportPath')
is not ready when initDatabasePath() runs before runApp(). The existing code
already catches PlatformException there, leaving _dbPath null — but the
LazyDatabase callback called getApplicationSupportDirectory() a second time
without any protection, causing an unhandled crash on those devices.
Fix: extract _resolveDatabasePath() which retries three times with back-off
(100 ms → 300 ms → 600 ms) before re-throwing with a descriptive error
message. By the time the database is first accessed (after runApp()), the
channel is almost always available; if it still isn't, the CrashScreen is
shown with a clear explanation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Only pick up issues created by guettli, guettlibot, or guettlibot2
to prevent the loop from acting on external/bot issues.
- Post an explanatory comment on the issue whenever the loop sets
State/Question (agent killed, no CI run, no push detected), so the
reason is visible without digging through cron logs. Closes#158.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Firebase CLI emits "A non-retryable error occurred." even for passing runs.
The grep -qwi 'error' triggered on this message despite gcloud exiting 0
and the result table showing Passed. The gcloud exit code, device-count,
and Passed checks are sufficient to detect real failures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: agents now create an `issue-N-fix` branch and open a PR;
the loop discovers the PR via `fgj pr list`, tracks its CI run, squash-merges
on green, and falls back to the global-CI path if no PR exists (backward compat).
Adds `_find_pr_for_branch`, `_latest_ci_run_for_branch`, `_merge_pr` helpers.
- .forgejo/workflows/ci.yml: strip to the single fast `check` job only
(removes build-linux, deploy-playstore, publish-website).
- .forgejo/workflows/deploy.yml (new, replaces android-emulator-tests.yml):
scheduled hourly + workflow_dispatch; runs firebase tests, Play Store deploy,
Linux build/deploy, website publish; on completion sets CI/Full-Pass or
CI/Full-Fail label on the repo's DEPLOY_HEALTH_ISSUE tracking issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the agent exits immediately (e.g. rate-limit), the loop was closing the
pending issue against the *previous* CI run, which was still green.
Fix: record the latest CI run ID when an issue agent starts. If the run ID
hasn't changed when the agent exits, the agent pushed nothing → set
State/Question instead of closing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter_markdown 0.7.7+1 has been discontinued in favour of
flutter_markdown_plus. Switch the dependency and update both import
sites.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Capture gcloud auth stderr separately and fail on unexpected output;
ignore the two known informational lines ("Activated service account
credentials for: [...]" and "Updated property [core/project].") while
keeping a strict "fail if unknown stderr" check for anything else.
- Replace the narrow pattern grep (non-retryable error|infrastructure_failure|
test execution failed) with a broad whole-word case-insensitive grep for
'error', so any infrastructure or Firebase error in the output causes CI
failure.
- Verify that the number of device result rows in the result table matches
the expected device count (1), so a silent test-run failure cannot slip
through.
- Add scripts/test_firebase_check.sh with 18 unit tests for the three new
bash patterns (auth stderr filter, error-word detection, device count).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All production secrets (SSH key, Android keystore, Play Store config,
Firebase service account) are already typed as dagger.Secret and injected
via WithMountedSecret / WithSecretVariable. Add a Secrets section to
DAGGER.md to make this explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gradle-cache volume was mounted without an owner, so the root-owned
volume caused "Permission denied" when the ci user tried to create
gradle-8.14-all.zip.lck during bundleRelease. Add Owner: "ci" to all
three WithMountedCache calls so the ci user can write to the caches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add -qq to apt-get update/install in Dagger toolchain to suppress
verbose package-list output (hundreds of lines on cold cache)
- Wrap sdkmanager in silent-on-success pattern — only shows output
on failure, like the build_runner and flutter pub get steps
- Set debug = warning in stunnel config to suppress LOG5 (info/notice)
startup lines while keeping LOG4 (warning) and above
- Add org.gradle.welcome=never to android/gradle.properties to
suppress the "Welcome to Gradle N.NN!" banner
- Filter SKIPPED Gradle tasks, Gradle Daemon startup messages, and
gcloud support-page promo lines in run_firebase_test.sh
Errors and warnings are preserved in all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cirruslabs/flutter:3.41.6 image already has UID 1000 assigned to
another user, so `useradd -u 1000` exits with code 4 ("UID not unique")
and the ci user is never created. Dagger then fails to resolve `owner:
"ci"` on subsequent WithDirectory calls. Removing the explicit UID lets
useradd pick the next available one.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>