The Dagger engine stopped responding (connection refused) after the
previous run exhausted disk space and crashed it. Two changes:
1. setup_dagger_remote.sh: retry the nc probe up to 5 times with 30 s
delays so a transient crash/restart window doesn't immediately fail
the job.
2. ci.yml: add a post-check prune step (if: always()) so the engine
cache is cleaned up after every run, reducing the chance of disk
exhaustion on the next run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix (retry × 3 with 60 s sleep) was not enough: all three
attempts still failed because the engine cache stayed full throughout.
Add an explicit `dagger query '{ engine { localCache { prune } } }'`
call (a) as a proactive step in ci.yml right after the stunnel setup,
and (b) inside the retry handler before each back-off sleep (now 90 s
instead of 60 s). The prune evicts stale execution-cache snapshots
(e.g. old pubspec.lock layers) so fresh disk is available when flutter
pub get runs. The `|| true` guard makes the prune non-fatal if the
query syntax changes between Dagger versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger engine occasionally runs out of disk during `flutter pub get`
when multiple CI jobs run in parallel. Space typically frees up within
~60 seconds as other containers finish. Add "No space left on device"
as a retryable condition with a 60 s back-off so PR runs survive the
transient shortage (run 4199480 was the trigger).
Both isCoreLibraryDesugaringEnabled = true in compileOptions and the
coreLibraryDesugaring("com.android.tools:desugar_jdk_libs:2.1.4")
dependency are already present in android/app/build.gradle.kts from
the earlier fix in #37. This commit closes issue #183 which was opened
to track the same requirement.
Replace local `task publish-website` invocation with `fgj actions workflow run website.yml`
so the deploy runs in CI rather than on the local machine. Remove failure-tracking state
files and issue-creation logic — Forgejo Actions handles its own reporting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues from #179:
- crash_screen.dart now reads GIT_HASH compile-time constant and includes
'Git Commit: <hash>' in both the on-screen UI and the copied report, so
crash reports always show the exact build that crashed.
- _resolveDatabasePath() retry delays extended from [100, 300, 600] ms
(total ~1 s, 4 attempts) to [200, 500, 1000, 2000, 4000] ms (total
~7.7 s, 6 attempts) to handle slow/non-standard Android devices where
the path_provider Pigeon channel takes several seconds to become ready.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Increases the retry delays in `_resolveDatabasePath()` from `[100, 300, 600]` ms (~1 s) to `[200, 500, 1000, 2000]` ms (~3.7 s).
- Adds a regression test (`test/unit/database_path_test.dart`) that verifies `initDatabasePath()` does not throw when the `path_provider` channel is unavailable.
## Root cause
On some slow Android devices (e.g. the Motorola reported in #166), the `path_provider` Pigeon channel is not ready even several seconds after `runApp()` returns. The previous back-off budget of ~1 s was not enough, causing `_resolveDatabasePath()` to exhaust all retries and throw a `PlatformException`, crashing the app with the message shown in the issue.
## Test plan
- [ ] `flutter test test/unit/database_path_test.dart` passes (new regression test)
- [ ] `flutter test test/unit/` — all 325 unit tests pass
- [ ] `flutter analyze` — no issues
Fixes#166
Co-authored-by: Thomas SharedInbox <sharedinbox@thomas-guettler.de>
Reviewed-on: https://codeberg.org/guettli/sharedinbox/pulls/169
The Forgejo workflow_runs API has no head_branch field. For pull_request
events the branch lives in event_payload["pull_request"]["head"]["ref"];
for push events it is in prettyref. The old code used run.get("head_branch")
which always returned None, causing _latest_ci_run_for_branch to never find
the run and the loop to declare "no CI run after 15 min" and set the issue to
State/Question — even when CI had already passed.
Also fixes a pre-existing test mock that was missing the session_name kwarg.
Adds TestLatestCiRunForBranch covering both event types and the regression.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
env:SSH_PRIVATE_KEY passes the key through shell $() which strips the
trailing newline, causing dagger to write a truncated key that OpenSSH
rejects with "error in libcrypto". Using file: reads it directly from
disk, preserving exact content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Dagger container running generate_build_history.py may not always
reach the deployment server (network constraints on the Dagger engine).
Rather than aborting the entire publish-website pipeline, log the SSH
verbose output (already added in the previous debug commit) and return
an empty file list so Hugo still builds and rsync still deploys the
site — just without updated build-history pages.
This unblocks the cron deploy that has been failing since c259d2da.
Temporary: print verbose SSH output on failure to identify why the
connection fails from inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger mounts the secret file with 0600 but the parent directory may
get created with world-readable permissions, causing SSH to refuse
the key with exit 255.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All other ssh/scp calls in the dagger module use explicit -i /root/.ssh/id_ed25519.
This one was missing it, causing exit 255 inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger parses .env directly and fails on multiline quoted values.
Move SSH_PRIVATE_KEY out of .env and export it from ~/.ssh/id_ed25519
in the wrapper instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tracks consecutive failure count in .fail_count. On the 5th failure
for the same SHA, creates a Prio/High + State/Ready Codeberg issue.
Before creating, checks local .last_issue_sha and queries Codeberg
open issues to avoid duplicates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the last deploy failed and origin/main has not advanced, opens a
Prio/High + State/Ready issue via tea with the failing SHA, commit link,
and captured deploy output. Skips duplicate issues (tracked by
.last_issue_sha). Cron interval changed to */5.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: create log dir with mode 0700 and enforce it on
existing dirs; open log files with mode 0600; chmod state file
to 0600 after every write. Prevents other local processes from
reading agent output (which may contain credential paths) or
tampering with the state file's pid field.
- ci/main.go (TestAndroidFirebase): replace
echo "$FIREBASE_SA_KEY" > /tmp/key.json
with bash process substitution
--key-file=<(echo "$FIREBASE_SA_KEY")
The key is now passed via a file descriptor — it never touches
disk, so it cannot be stranded by a failed gcloud auth call or
snapshotted into the Dagger layer cache.
- ci.yml / deploy.yml: add "Cleanup TLS credentials" step
(if: always()) at the end of every job that calls
setup_dagger_remote.sh. Removes /tmp/dagger-tls,
/tmp/stunnel-dagger.conf, /tmp/stunnel.pid from the self-hosted
runner after each job, so client certs do not accumulate between
job runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get is pure Dart — it never invokes Gradle. The mutable
gradle-cache volume mount caused the same execution-cache instability
we just fixed for the pub cache: Dagger sees a changed volume and
cache-misses pubGetLayer() on every run.
The Gradle cache stays in Base(), which is only used for steps that
actually build Android code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The mutable flutter-pub-cache volume made the execution cache key unstable —
pub get cache-missed every run because the volume's mutable layer changed the
snapshot hash. Removing the volume lets Dagger snapshot packages inside the
execution-cache layer, which is stable and reclaimable via dagger prune.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On some Android versions the path_provider Pigeon channel
('dev.flutter.pigeon.path_provider_android.PathProviderApi.getApplicationSupportPath')
is not ready when initDatabasePath() runs before runApp(). The existing code
already catches PlatformException there, leaving _dbPath null — but the
LazyDatabase callback called getApplicationSupportDirectory() a second time
without any protection, causing an unhandled crash on those devices.
Fix: extract _resolveDatabasePath() which retries three times with back-off
(100 ms → 300 ms → 600 ms) before re-throwing with a descriptive error
message. By the time the database is first accessed (after runApp()), the
channel is almost always available; if it still isn't, the CrashScreen is
shown with a clear explanation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Only pick up issues created by guettli, guettlibot, or guettlibot2
to prevent the loop from acting on external/bot issues.
- Post an explanatory comment on the issue whenever the loop sets
State/Question (agent killed, no CI run, no push detected), so the
reason is visible without digging through cron logs. Closes#158.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Firebase CLI emits "A non-retryable error occurred." even for passing runs.
The grep -qwi 'error' triggered on this message despite gcloud exiting 0
and the result table showing Passed. The gcloud exit code, device-count,
and Passed checks are sufficient to detect real failures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: agents now create an `issue-N-fix` branch and open a PR;
the loop discovers the PR via `fgj pr list`, tracks its CI run, squash-merges
on green, and falls back to the global-CI path if no PR exists (backward compat).
Adds `_find_pr_for_branch`, `_latest_ci_run_for_branch`, `_merge_pr` helpers.
- .forgejo/workflows/ci.yml: strip to the single fast `check` job only
(removes build-linux, deploy-playstore, publish-website).
- .forgejo/workflows/deploy.yml (new, replaces android-emulator-tests.yml):
scheduled hourly + workflow_dispatch; runs firebase tests, Play Store deploy,
Linux build/deploy, website publish; on completion sets CI/Full-Pass or
CI/Full-Fail label on the repo's DEPLOY_HEALTH_ISSUE tracking issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the agent exits immediately (e.g. rate-limit), the loop was closing the
pending issue against the *previous* CI run, which was still green.
Fix: record the latest CI run ID when an issue agent starts. If the run ID
hasn't changed when the agent exits, the agent pushed nothing → set
State/Question instead of closing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter_markdown 0.7.7+1 has been discontinued in favour of
flutter_markdown_plus. Switch the dependency and update both import
sites.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Capture gcloud auth stderr separately and fail on unexpected output;
ignore the two known informational lines ("Activated service account
credentials for: [...]" and "Updated property [core/project].") while
keeping a strict "fail if unknown stderr" check for anything else.
- Replace the narrow pattern grep (non-retryable error|infrastructure_failure|
test execution failed) with a broad whole-word case-insensitive grep for
'error', so any infrastructure or Firebase error in the output causes CI
failure.
- Verify that the number of device result rows in the result table matches
the expected device count (1), so a silent test-run failure cannot slip
through.
- Add scripts/test_firebase_check.sh with 18 unit tests for the three new
bash patterns (auth stderr filter, error-word detection, device count).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All production secrets (SSH key, Android keystore, Play Store config,
Firebase service account) are already typed as dagger.Secret and injected
via WithMountedSecret / WithSecretVariable. Add a Secrets section to
DAGGER.md to make this explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gradle-cache volume was mounted without an owner, so the root-owned
volume caused "Permission denied" when the ci user tried to create
gradle-8.14-all.zip.lck during bundleRelease. Add Owner: "ci" to all
three WithMountedCache calls so the ci user can write to the caches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add -qq to apt-get update/install in Dagger toolchain to suppress
verbose package-list output (hundreds of lines on cold cache)
- Wrap sdkmanager in silent-on-success pattern — only shows output
on failure, like the build_runner and flutter pub get steps
- Set debug = warning in stunnel config to suppress LOG5 (info/notice)
startup lines while keeping LOG4 (warning) and above
- Add org.gradle.welcome=never to android/gradle.properties to
suppress the "Welcome to Gradle N.NN!" banner
- Filter SKIPPED Gradle tasks, Gradle Daemon startup messages, and
gcloud support-page promo lines in run_firebase_test.sh
Errors and warnings are preserved in all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cirruslabs/flutter:3.41.6 image already has UID 1000 assigned to
another user, so `useradd -u 1000` exits with code 4 ("UID not unique")
and the ci user is never created. Dagger then fails to resolve `owner:
"ci"` on subsequent WithDirectory calls. Removing the explicit UID lets
useradd pick the next available one.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Create a non-root user 'ci' (UID 1000) in the Dagger toolchain container,
transfer ownership of the Flutter SDK and Android SDK to that user, and
switch to it with WithUser("ci"). Update all cache mount paths from /root/
to /home/ci/ and set Owner: "ci" on every WithDirectory call so Flutter
can write build output. Flutter emits a strong warning when run as root;
this change eliminates that warning by running the tool as a regular user.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On some Android devices (e.g. Android S1RXS32.50-13-25) the WorkManager
platform channel fails to connect at startup, throwing
PlatformException(channel-error, ...). registerBackgroundSync() now catches
PlatformException and MissingPluginException (plus any other unexpected
failure) and silently disables background sync rather than crashing the app.
Test added: test/unit/background_sync_test.dart verifies the function
completes without throwing in the unit-test environment (where the native
plugin is absent).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>