Replace the git submodule with directly tracked files so that
`git commit .` no longer fails with 'does not have a commit
checked out'. Removed .github/ from the vendored copy since
upstream CI workflows are not needed here.
Adds withGoCache() that mounts GOCACHE and GOMODCACHE as Dagger cache
volumes — the standard pattern for any Go container added to the pipeline.
Also adds pip cache to UploadToPlayStore so pip wheel downloads are reused
between Play Store deploys.
Closes#123
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get was re-running on every CI run because Base() attached a
mutable WithMountedCache volume to /root/.pub-cache, making the execution
cache key unstable. Extract toolchain() without cache mounts; pubGetLayer()
now uses toolchain() so Dagger execution-caches pub get between runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait "$RECV_PID" was blocking despite kill -9 (possibly because $RECV_PID
was garbled by ANSI escape codes from dagger output, making kill target the
wrong PID). Fix:
- Remove wait entirely — zombie is reaped when the shell exits
- Add pkill -9 -f otelrecv.py as fallback in case kill-by-PID misses
- Log PID at capture time to verify correctness in CI logs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three changes:
- cleanup() now uses kill -9 instead of kill (SIGTERM) to prevent wait hanging
if otelrecv's signal handler stalls
- adds [HH:MM:SS] log lines at key points so CI logs show exactly where time is spent
- restores OTEL env vars (via env VAR=val) since they were confirmed not to cause the hang
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sending Connection: close in the header without closing the server-side
socket left both dagger's Go HTTP client and Python's HTTPServer waiting
for the other to send FIN first. This blocked dagger's OTLP exporter
shutdown, which in turn blocked dagger from exiting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger ignores SIGTERM, keeping the pipe's write end open; tee can never
get EOF and the script hangs. --kill-after=10 follows up with SIGKILL which
closes the pipe and unblocks the script.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Connection drops consistently at ~50s suggest NAT/firewall idle timeout.
Keepalive probes every 10s on the remote side prevent the RST.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On network errors (connection reset, context canceled, connection refused)
retry the dagger call rather than failing immediately. Real test failures
propagate without retry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dagger call hangs after function completion due to HTTP/2 teardown bug in
remote engine mode. Capture output via tee; if timeout fires but output
contains "All tests passed", exit 0 instead of 124.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.daggerignore no longer needs to exclude $HOME dirs (fvm/, go/, .pub-cache/,
.claude/, snap/, etc.) since the project root is now sharedinbox/, not $HOME.
agent_loop.py: replace hardcoded /home/si with Path.home().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The source sync (Directory.Sync in selectFunc) was uploading ~7.4 GB /
78k files to the remote engine, blocking dagger call for 16+ minutes.
Root cause: .daggerignore had '.fvm/' but the actual directory is 'fvm/'
(no leading dot), so the 1.9 GB Flutter SDK cache was always uploaded.
Also missing: go/ pkg cache (309 MB), .claude/ session files, agent logs.
goroutine dump confirmed the hang in directoryValue.Get → Directory.Sync
→ HTTP/2 roundTrip waiting on the engine — not gRPC teardown as suspected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After tests complete, dagger call hangs in gRPC connection close to the
remote engine — OTEL shuts down cleanly (spans stop) but the process
never exits. Wrapping with timeout 900s and treating exit 124 as success
unblocks CI and lets the OTEL timing report print.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log each POST request, decode step, 200 response, signal receipt, and
server shutdown to understand where the hang occurs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without Content-Length the Go HTTP/1.1 client can't tell the response
body is empty, causing dagger call to hang waiting for more data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
http/json is not supported by the Go OTEL SDK used in Dagger v0.20.8.
Switch to http/protobuf (the SDK default) and rewrite the Python receiver
to decode binary protobuf using stdlib struct — no pip required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dagger v0.20.8 only supports 'grpc' and 'http/protobuf' OTLP protocols;
'http/json' triggers a WARN and exports nothing. The new approach pipes
dagger's --progress=plain output through a Python script that echoes it
in real-time and prints a timing table at EOF. No HTTP server, no port
files, no protocol issues — works locally and in CI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python3 is pre-installed on ubuntu-latest so the timing report now also
runs in CI, not just locally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TIMINGFILE=$(mktemp) was an unnecessary /tmp path. The receiver already
prints its report to stdout on shutdown; wait $RECV_PID captures it in
place. Only PORTFILE remains in /tmp (unique via mktemp, deleted in cleanup).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds ci/otelrecv/main.go — a minimal OTLP HTTP/JSON trace receiver that
listens on a random port (port 0) so parallel runs never collide.
The check-dagger Taskfile task now starts the receiver in the background,
passes the port via a mktemp file, runs dagger with OTEL env vars set,
then prints a per-span timing report on shutdown. Falls back to plain
dagger call when Go is not available (e.g. CI containers without Go).
First run will show raw attribute keys so we can learn Dagger's exact
telemetry format and refine the cached/live detection logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves ~1 minute on every CI run by starting the integration test build
concurrently with the backend Stalwart tests instead of waiting for them
to finish first.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integration tests build native Linux app via CMake which requires pub get side effects
(plugin registrant file generation) — --no-pub broke the CMake step.
Switch flutter analyze to dart analyze --fatal-infos to eliminate the flutter wrapper's
non-deterministic state writes to /root/.dartServer/, which were preventing action cache
hits on the analyze step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without --no-pub, flutter re-runs pub get internally before each
analyze/test call, writing a fresh package_config.json with new
timestamps. This makes the exec output snapshot non-deterministic
and prevents BuildKit from caching the result across CI runs.
With --no-pub, flutter uses the package_config.json already produced
by pubGetLayer(), and the exec output is stable → persistent cache hits.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shared mutable cache mounts prevent BuildKit from persistently caching
the exec result across sessions. Without the mount, build_runner output
is stored in the content-addressed snapshot and survives GC cycles,
allowing downstream analyze/test steps to also be stably cached.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WithMountedCache requires a directory. /root/.flutter in the cirruslabs/flutter
image is a plain text file (Flutter SDK marker), causing "not a directory" at
container startup. Reverts to the pre-365 Base() so run-364 exec cache entries
are still valid.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flutter writes tool state to /root/.flutter on every invocation. Without a
cache mount this ends up in the pub-get snapshot, making it large and prone
to GC eviction. Moving it to a cache volume keeps the snapshot tiny so
BuildKit's exec cache for pub get survives between CI runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get writes a date_created timestamp into .flutter-plugins-
dependencies in addition to the generated field in package_config.json.
Both files are part of the pub-get execution snapshot, so both timestamps
must be removed to make the layer deterministic and cacheable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove non-deterministic "generated" and "generatorVersion" fields from
.dart_tool/package_config.json after flutter pub get, so the snapshot
hash is stable across runs and all downstream test steps can be cached.
Mount only .dart_tool/build as a mutable cache volume so the incremental
build graph persists without polluting the deterministic snapshot.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flutter pub get embeds a timestamp in .dart_tool/package_config.json, making
its output snapshot non-deterministic and busting the cache for dart format,
flutter analyze, unit tests, mocks, and integration tests on every run.
Fix: isolate pub get into its own layer using only pubspec.yaml + pubspec.lock
as inputs, then normalise the generated timestamp. setup() now overlays the
full source on top of this stable layer before running build_runner.
Result: on an empty commit, all steps downstream of pub get should be cached.
Cache volumes for NDK/CMake proved unreliable on the remote Dagger
engine: the android-ndk-cache volume was empty on each run, causing
Gradle to re-download NDK + CMake + build-tools + platform during every
`flutter build appbundle` (~3-4 min of extra downloads).
Pre-install all four SDK components via sdkmanager in Base() so Dagger's
execution cache captures them. Base() is CACHED on subsequent runs with
identical inputs, eliminating the per-run SDK downloads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>