Replace local `task publish-website` invocation with `fgj actions workflow run website.yml`
so the deploy runs in CI rather than on the local machine. Remove failure-tracking state
files and issue-creation logic — Forgejo Actions handles its own reporting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues from #179:
- crash_screen.dart now reads GIT_HASH compile-time constant and includes
'Git Commit: <hash>' in both the on-screen UI and the copied report, so
crash reports always show the exact build that crashed.
- _resolveDatabasePath() retry delays extended from [100, 300, 600] ms
(total ~1 s, 4 attempts) to [200, 500, 1000, 2000, 4000] ms (total
~7.7 s, 6 attempts) to handle slow/non-standard Android devices where
the path_provider Pigeon channel takes several seconds to become ready.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Forgejo workflow_runs API has no head_branch field. For pull_request
events the branch lives in event_payload["pull_request"]["head"]["ref"];
for push events it is in prettyref. The old code used run.get("head_branch")
which always returned None, causing _latest_ci_run_for_branch to never find
the run and the loop to declare "no CI run after 15 min" and set the issue to
State/Question — even when CI had already passed.
Also fixes a pre-existing test mock that was missing the session_name kwarg.
Adds TestLatestCiRunForBranch covering both event types and the regression.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
env:SSH_PRIVATE_KEY passes the key through shell $() which strips the
trailing newline, causing dagger to write a truncated key that OpenSSH
rejects with "error in libcrypto". Using file: reads it directly from
disk, preserving exact content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Dagger container running generate_build_history.py may not always
reach the deployment server (network constraints on the Dagger engine).
Rather than aborting the entire publish-website pipeline, log the SSH
verbose output (already added in the previous debug commit) and return
an empty file list so Hugo still builds and rsync still deploys the
site — just without updated build-history pages.
This unblocks the cron deploy that has been failing since c259d2da.
Temporary: print verbose SSH output on failure to identify why the
connection fails from inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger mounts the secret file with 0600 but the parent directory may
get created with world-readable permissions, causing SSH to refuse
the key with exit 255.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All other ssh/scp calls in the dagger module use explicit -i /root/.ssh/id_ed25519.
This one was missing it, causing exit 255 inside the dagger container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger parses .env directly and fails on multiline quoted values.
Move SSH_PRIVATE_KEY out of .env and export it from ~/.ssh/id_ed25519
in the wrapper instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tracks consecutive failure count in .fail_count. On the 5th failure
for the same SHA, creates a Prio/High + State/Ready Codeberg issue.
Before creating, checks local .last_issue_sha and queries Codeberg
open issues to avoid duplicates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the last deploy failed and origin/main has not advanced, opens a
Prio/High + State/Ready issue via tea with the failing SHA, commit link,
and captured deploy output. Skips duplicate issues (tracked by
.last_issue_sha). Cron interval changed to */5.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: create log dir with mode 0700 and enforce it on
existing dirs; open log files with mode 0600; chmod state file
to 0600 after every write. Prevents other local processes from
reading agent output (which may contain credential paths) or
tampering with the state file's pid field.
- ci/main.go (TestAndroidFirebase): replace
echo "$FIREBASE_SA_KEY" > /tmp/key.json
with bash process substitution
--key-file=<(echo "$FIREBASE_SA_KEY")
The key is now passed via a file descriptor — it never touches
disk, so it cannot be stranded by a failed gcloud auth call or
snapshotted into the Dagger layer cache.
- ci.yml / deploy.yml: add "Cleanup TLS credentials" step
(if: always()) at the end of every job that calls
setup_dagger_remote.sh. Removes /tmp/dagger-tls,
/tmp/stunnel-dagger.conf, /tmp/stunnel.pid from the self-hosted
runner after each job, so client certs do not accumulate between
job runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get is pure Dart — it never invokes Gradle. The mutable
gradle-cache volume mount caused the same execution-cache instability
we just fixed for the pub cache: Dagger sees a changed volume and
cache-misses pubGetLayer() on every run.
The Gradle cache stays in Base(), which is only used for steps that
actually build Android code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The mutable flutter-pub-cache volume made the execution cache key unstable —
pub get cache-missed every run because the volume's mutable layer changed the
snapshot hash. Removing the volume lets Dagger snapshot packages inside the
execution-cache layer, which is stable and reclaimable via dagger prune.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On some Android versions the path_provider Pigeon channel
('dev.flutter.pigeon.path_provider_android.PathProviderApi.getApplicationSupportPath')
is not ready when initDatabasePath() runs before runApp(). The existing code
already catches PlatformException there, leaving _dbPath null — but the
LazyDatabase callback called getApplicationSupportDirectory() a second time
without any protection, causing an unhandled crash on those devices.
Fix: extract _resolveDatabasePath() which retries three times with back-off
(100 ms → 300 ms → 600 ms) before re-throwing with a descriptive error
message. By the time the database is first accessed (after runApp()), the
channel is almost always available; if it still isn't, the CrashScreen is
shown with a clear explanation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Only pick up issues created by guettli, guettlibot, or guettlibot2
to prevent the loop from acting on external/bot issues.
- Post an explanatory comment on the issue whenever the loop sets
State/Question (agent killed, no CI run, no push detected), so the
reason is visible without digging through cron logs. Closes#158.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Firebase CLI emits "A non-retryable error occurred." even for passing runs.
The grep -qwi 'error' triggered on this message despite gcloud exiting 0
and the result table showing Passed. The gcloud exit code, device-count,
and Passed checks are sufficient to detect real failures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_loop.py: agents now create an `issue-N-fix` branch and open a PR;
the loop discovers the PR via `fgj pr list`, tracks its CI run, squash-merges
on green, and falls back to the global-CI path if no PR exists (backward compat).
Adds `_find_pr_for_branch`, `_latest_ci_run_for_branch`, `_merge_pr` helpers.
- .forgejo/workflows/ci.yml: strip to the single fast `check` job only
(removes build-linux, deploy-playstore, publish-website).
- .forgejo/workflows/deploy.yml (new, replaces android-emulator-tests.yml):
scheduled hourly + workflow_dispatch; runs firebase tests, Play Store deploy,
Linux build/deploy, website publish; on completion sets CI/Full-Pass or
CI/Full-Fail label on the repo's DEPLOY_HEALTH_ISSUE tracking issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If the agent exits immediately (e.g. rate-limit), the loop was closing the
pending issue against the *previous* CI run, which was still green.
Fix: record the latest CI run ID when an issue agent starts. If the run ID
hasn't changed when the agent exits, the agent pushed nothing → set
State/Question instead of closing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter_markdown 0.7.7+1 has been discontinued in favour of
flutter_markdown_plus. Switch the dependency and update both import
sites.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Capture gcloud auth stderr separately and fail on unexpected output;
ignore the two known informational lines ("Activated service account
credentials for: [...]" and "Updated property [core/project].") while
keeping a strict "fail if unknown stderr" check for anything else.
- Replace the narrow pattern grep (non-retryable error|infrastructure_failure|
test execution failed) with a broad whole-word case-insensitive grep for
'error', so any infrastructure or Firebase error in the output causes CI
failure.
- Verify that the number of device result rows in the result table matches
the expected device count (1), so a silent test-run failure cannot slip
through.
- Add scripts/test_firebase_check.sh with 18 unit tests for the three new
bash patterns (auth stderr filter, error-word detection, device count).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All production secrets (SSH key, Android keystore, Play Store config,
Firebase service account) are already typed as dagger.Secret and injected
via WithMountedSecret / WithSecretVariable. Add a Secrets section to
DAGGER.md to make this explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gradle-cache volume was mounted without an owner, so the root-owned
volume caused "Permission denied" when the ci user tried to create
gradle-8.14-all.zip.lck during bundleRelease. Add Owner: "ci" to all
three WithMountedCache calls so the ci user can write to the caches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add -qq to apt-get update/install in Dagger toolchain to suppress
verbose package-list output (hundreds of lines on cold cache)
- Wrap sdkmanager in silent-on-success pattern — only shows output
on failure, like the build_runner and flutter pub get steps
- Set debug = warning in stunnel config to suppress LOG5 (info/notice)
startup lines while keeping LOG4 (warning) and above
- Add org.gradle.welcome=never to android/gradle.properties to
suppress the "Welcome to Gradle N.NN!" banner
- Filter SKIPPED Gradle tasks, Gradle Daemon startup messages, and
gcloud support-page promo lines in run_firebase_test.sh
Errors and warnings are preserved in all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cirruslabs/flutter:3.41.6 image already has UID 1000 assigned to
another user, so `useradd -u 1000` exits with code 4 ("UID not unique")
and the ci user is never created. Dagger then fails to resolve `owner:
"ci"` on subsequent WithDirectory calls. Removing the explicit UID lets
useradd pick the next available one.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Create a non-root user 'ci' (UID 1000) in the Dagger toolchain container,
transfer ownership of the Flutter SDK and Android SDK to that user, and
switch to it with WithUser("ci"). Update all cache mount paths from /root/
to /home/ci/ and set Owner: "ci" on every WithDirectory call so Flutter
can write build output. Flutter emits a strong warning when run as root;
this change eliminates that warning by running the tool as a regular user.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On some Android devices (e.g. Android S1RXS32.50-13-25) the WorkManager
platform channel fails to connect at startup, throwing
PlatformException(channel-error, ...). registerBackgroundSync() now catches
PlatformException and MissingPluginException (plus any other unexpected
failure) and silently disables background sync rather than crashing the app.
Test added: test/unit/background_sync_test.dart verifies the function
completes without throwing in the unit-test environment (where the native
plugin is absent).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`tea api` exits 0 even on 401 responses, so `_close_issue` and
`_set_labels` appeared to succeed but did nothing. Issues were never
actually closed, causing them to be picked up again every cron tick.
Switch all write operations (close issue, set labels) and issue-list
reads to `fgj`, which has proper authentication. Keep `tea api` only
for CI run fetches where `fgj` times out (504). Add ~/go/bin to the
cron PATH so fgj is found.
Also add an error check in `_tea_get` for API-level error responses,
and strip State/InProgress when closing an issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes:
1. notification_service.dart: initNotifications() now catches
MissingPluginException (and any other init failure) so the app no
longer crashes when flutter_local_notifications is unavailable on
some Android devices. _initialized tracks success; showNewMailNotification
skips the plugin call when it never initialised.
2. crash_screen.dart: "Report Issue on Codeberg" no longer puts the full
report in the URL query string. Long stack traces exceeded browser
URL-length limits and caused "create issue failed". The URL now
carries only the pre-filled title; the user copies the full report
via "Copy to Clipboard" and pastes it in the issue body.
Tests added:
- test/unit/notification_service_test.dart: verifies initNotifications()
completes without throwing when the plugin channel is unavailable.
- test/widget/crash_screen_test.dart: verifies the Codeberg URL contains
the title but no &body= parameter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously build_runner compiled separately for each setup() variant
(checkSrc, backendSrc, integrationSrc, etc.) since their differing
source inputs produced distinct Dagger cache keys. CheckMocks also ran
build_runner twice: once inside setup() and again explicitly — and the
second run always compared two freshly-generated outputs, so stale mocks
in the repo were never detected.
Introduce codegenBase() that runs build_runner on the minimal common
source (lib/, test/, assets/, pubspec.*) excluding committed generated
files. All setup() calls now share this single Dagger cache entry, so
build_runner compiles only once per pipeline run instead of once per
source variant.
Fix CheckMocks to start from pubGetLayer() + committed source (including
any stale *.mocks.dart), commit that state as the git baseline, then run
build_runner once. The subsequent git diff now correctly detects stale
mocks in the repository, matching the behaviour of check_mocks_fresh.sh.
Also update Graph() to reflect the new codegenBase node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously issue agents were instructed to close the issue via prompt text
immediately after pushing. If CI then failed, the issue was already closed.
Now the loop tracks a pending_issue across cron ticks:
- When an agent finishes (issue or ci-fix), the issue number is extracted
from state before it is cleared.
- If CI is still running, a "pending-ci" state preserves the issue number.
- If CI fails, the ci-fix agent is started with the issue number in state
so it survives the fix cycle.
- Once CI passes, _close_issue() is called from Python — never by the agent.
The agent prompt no longer instructs the agent to close the issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `---------------------- Starting YYYY-MM-DD HH:MMZ` header at each run
- Remove `[agent_loop]` prefix from all output lines
- Show full Codeberg URL for CI runs instead of bare run ID
- Show full issue URL and title when referencing issues
- Store issue_title in state file so "still running" messages include the title
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
State/Ready → State/InProgress is already set by agent_loop.py before
the agent starts. Update AGENTS.md to reflect that agents invoked via
the loop must not set InProgress themselves (only manual workflows need
to). Also fix TestMain tests that called main() directly, which caused
argparse to consume sys.argv; they now call _run_loop() instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, `gcloud firebase test android run` could exit 0 while printing
"A non-retryable error occurred." in its output. The old check
`&& echo "$out" || { exit 1; }` only caught non-zero exit codes, and the
success grep `'Passed|passed|test cases'` was too broad — "test cases" can
appear in Firebase output before the error, giving a false positive.
The fix captures gcloud's exit code explicitly via `rc=$?`, adds an explicit
error-string check for known Firebase failure phrases (non-retryable error,
infrastructure_failure, test execution failed), and tightens the success
pattern to `'Passed|passed'` only.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused the crash-at-startup report:
1. CrashScreen used the widget's build context (above its own MaterialApp)
for ScaffoldMessenger.of() in button callbacks. When the screen is the
root widget — the runApp() path after a startup crash — there is no
ScaffoldMessenger above it, so both 'Copy to Clipboard' and 'Report Issue
on Codeberg' crashed with a null check error. Fix: wrap Scaffold.body in
Builder to obtain a context that is a descendant of the Scaffold.
2. path_provider_android 2.2.21 updated to Pigeon 26, which causes a
channel-error on startup for some Android devices. Pin to <2.2.21
(resolves to 2.2.20, which uses the stable pre-Pigeon-26 implementation).
Additionally, make initDatabasePath() catch PlatformException so a
channel error at the very start of main() no longer hard-crashes the app;
_openConnection()'s lazy fallback retries after runApp() completes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace curl-based install of dagger/task with a hard check that
fails immediately if any tool is missing from the runner image,
pointing to .forgejo/Dockerfile as the fix location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Based on ghcr.io/catthehacker/ubuntu:go-24.04 with stunnel4,
netcat-openbsd, dagger v0.20.8 and task v3.48.0 baked in so
nothing is downloaded during CI runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gcloud exits 0 even when no tests ran. Add a post-check that greps
the output for 'Passed/passed/test cases' and fails explicitly if
none are found, so 'no test case results' turns the CI red.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously setUpAll() fell back to 127.0.0.1 defaults when env vars
were absent, causing Firebase Test Lab to report '0 test case results'
instead of a clear failure. Now it calls fail() immediately with the
list of missing variables.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace apt-get install with a hard check — if the packages are missing
the job fails immediately with a clear error. Avoids flaky failures when
archive.ubuntu.com is unreachable.
Install once on the runner: sudo apt-get install -y stunnel4 netcat-openbsd
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pubspec.lock was incorrectly gitignored — this is a Flutter app, not a
package, so the lockfile should be committed for reproducible builds.
Without it, CI resolved drift to its minimum (2.20.3) which constrains
sqlite3 to 2.x, causing dart analyze to disagree on whether
Database.close() exists vs the local environment using 3.3.1.
Also pins sqlite3: ^3.1.5 explicitly in pubspec.yaml as belt-and-
suspenders so the constraint is visible without reading the lockfile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The default Firebase Test Lab bucket is in a Google-managed project so
project-level IAM grants have no effect on it. Use sharedinbox-ftl-results
which is in sharedinbox-496103 where the service account has storage.admin.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WithEnvVariable(CACHE_BUSTER, time.Now()) ensures gcloud firebase test
always runs fresh rather than returning a cached result from a prior run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add scripts/run_firebase_test.sh that strips ANSI codes and removes
UP-TO-DATE task lines, libsqlite warnings, Gradle deprecation notices
and other high-volume noise before it hits the CI log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'Pixel6' is not a valid Firebase Test Lab model ID.
'oriole' is the correct internal codename for Pixel 6.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The exact output path varies by AGP version. Use find to locate the
test APK and copy it to a known location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements issue #132. Builds debug app APK + androidTest APK via Dagger,
then runs them on Firebase Test Lab using the FIREBASE_TEST_LAB_SERVICE_ACCOUNT_KEY
secret and FIREBASE_PROJECT_ID variable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GET /shutdown endpoint on otel-receiver.py is the one clean shutdown
path. cleanup() only needs to remove temp files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rename ci/otelrecv.py to ci/otel-receiver.py for readability.
Replace SIGTERM+wait shutdown (which could hang indefinitely) with an
HTTP-based approach: add GET /shutdown to otel-receiver.py that calls
self.server.shutdown() directly. After dagger call returns, curl that
endpoint so the receiver prints its timing report and exits cleanly.
Cleanup is reduced to a SIGKILL fallback in case the process is already
gone.
Also fix the do_GET handler to reference self.server instead of the
local variable server, which was inaccessible from the handler class.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Filter flutter pub get package-listing lines (^[+~><] ) in pubGetLayer
- Filter build_runner compilation-progress lines (^\[) in setup() and CheckMocks()
- Add -q to git commit in CheckMocks to suppress "460 files changed" stats
- Wrap flutter test in Coverage, TestBackend, TestIntegration, TestSyncReliability
to show only the summary line on success and full output on failure
- Apply same build_runner filter to scripts/check_mocks_fresh.sh for local runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If any step hangs (stuck service, deadlocked test, network stall), the
pipeline will now cancel itself after 30 min rather than blocking the
runner indefinitely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove per-request debug logs from otelrecv.py (POST, decoding,
decoded, 200 sent, signal) that were added to diagnose the CI hang,
which has since been resolved.
Remove verbose [HH:MM:SS] timestamp messages from check-dagger
(start, pipeline done, otelrecv started/ready, final RC, cleanup
start/done) for the same reason.
Fix cleanup to send SIGTERM + wait instead of SIGKILL so the OTEL
timing report is actually printed at the end of each CI run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a Ci.Graph() Dagger function that emits a Mermaid flowchart showing
both the Dagger Check pipeline (toolchain → pubGetLayer → parallel steps)
and the Codeberg CI job dependencies (check → build-linux / deploy-playstore
→ publish-website).
Usage: dagger call -m ci --source=. graph
task ci-graph
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the git submodule with directly tracked files so that
`git commit .` no longer fails with 'does not have a commit
checked out'. Removed .github/ from the vendored copy since
upstream CI workflows are not needed here.
Adds withGoCache() that mounts GOCACHE and GOMODCACHE as Dagger cache
volumes — the standard pattern for any Go container added to the pipeline.
Also adds pip cache to UploadToPlayStore so pip wheel downloads are reused
between Play Store deploys.
Closes#123
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get was re-running on every CI run because Base() attached a
mutable WithMountedCache volume to /root/.pub-cache, making the execution
cache key unstable. Extract toolchain() without cache mounts; pubGetLayer()
now uses toolchain() so Dagger execution-caches pub get between runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait "$RECV_PID" was blocking despite kill -9 (possibly because $RECV_PID
was garbled by ANSI escape codes from dagger output, making kill target the
wrong PID). Fix:
- Remove wait entirely — zombie is reaped when the shell exits
- Add pkill -9 -f otelrecv.py as fallback in case kill-by-PID misses
- Log PID at capture time to verify correctness in CI logs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three changes:
- cleanup() now uses kill -9 instead of kill (SIGTERM) to prevent wait hanging
if otelrecv's signal handler stalls
- adds [HH:MM:SS] log lines at key points so CI logs show exactly where time is spent
- restores OTEL env vars (via env VAR=val) since they were confirmed not to cause the hang
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sending Connection: close in the header without closing the server-side
socket left both dagger's Go HTTP client and Python's HTTPServer waiting
for the other to send FIN first. This blocked dagger's OTLP exporter
shutdown, which in turn blocked dagger from exiting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger ignores SIGTERM, keeping the pipe's write end open; tee can never
get EOF and the script hangs. --kill-after=10 follows up with SIGKILL which
closes the pipe and unblocks the script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Connection drops consistently at ~50s suggest NAT/firewall idle timeout.
Keepalive probes every 10s on the remote side prevent the RST.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On network errors (connection reset, context canceled, connection refused)
retry the dagger call rather than failing immediately. Real test failures
propagate without retry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger call hangs after function completion due to HTTP/2 teardown bug in
remote engine mode. Capture output via tee; if timeout fires but output
contains "All tests passed", exit 0 instead of 124.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.daggerignore no longer needs to exclude $HOME dirs (fvm/, go/, .pub-cache/,
.claude/, snap/, etc.) since the project root is now sharedinbox/, not $HOME.
agent_loop.py: replace hardcoded /home/si with Path.home().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The source sync (Directory.Sync in selectFunc) was uploading ~7.4 GB /
78k files to the remote engine, blocking dagger call for 16+ minutes.
Root cause: .daggerignore had '.fvm/' but the actual directory is 'fvm/'
(no leading dot), so the 1.9 GB Flutter SDK cache was always uploaded.
Also missing: go/ pkg cache (309 MB), .claude/ session files, agent logs.
goroutine dump confirmed the hang in directoryValue.Get → Directory.Sync
→ HTTP/2 roundTrip waiting on the engine — not gRPC teardown as suspected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After tests complete, dagger call hangs in gRPC connection close to the
remote engine — OTEL shuts down cleanly (spans stop) but the process
never exits. Wrapping with timeout 900s and treating exit 124 as success
unblocks CI and lets the OTEL timing report print.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log each POST request, decode step, 200 response, signal receipt, and
server shutdown to understand where the hang occurs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without Content-Length the Go HTTP/1.1 client can't tell the response
body is empty, causing dagger call to hang waiting for more data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
http/json is not supported by the Go OTEL SDK used in Dagger v0.20.8.
Switch to http/protobuf (the SDK default) and rewrite the Python receiver
to decode binary protobuf using stdlib struct — no pip required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger v0.20.8 only supports 'grpc' and 'http/protobuf' OTLP protocols;
'http/json' triggers a WARN and exports nothing. The new approach pipes
dagger's --progress=plain output through a Python script that echoes it
in real-time and prints a timing table at EOF. No HTTP server, no port
files, no protocol issues — works locally and in CI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python3 is pre-installed on ubuntu-latest so the timing report now also
runs in CI, not just locally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TIMINGFILE=$(mktemp) was an unnecessary /tmp path. The receiver already
prints its report to stdout on shutdown; wait $RECV_PID captures it in
place. Only PORTFILE remains in /tmp (unique via mktemp, deleted in cleanup).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds ci/otelrecv/main.go — a minimal OTLP HTTP/JSON trace receiver that
listens on a random port (port 0) so parallel runs never collide.
The check-dagger Taskfile task now starts the receiver in the background,
passes the port via a mktemp file, runs dagger with OTEL env vars set,
then prints a per-span timing report on shutdown. Falls back to plain
dagger call when Go is not available (e.g. CI containers without Go).
First run will show raw attribute keys so we can learn Dagger's exact
telemetry format and refine the cached/live detection logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>