While debugging a flaky CI failure (ENOSPC: no space left on device), I instrumented our workflow to sample disk usage every test. The result has been very surprising and I'd like to ask whether this is known runner-agent behavior or something to file with the image team instead.
Symptoms
df / reports up to 12–15 GB used during the test run that cannot be accounted for by any standard tool:
du -shx of every top-level directory on / (sum of all directories) doesn't grow.
lsof +L1 (deleted-but-open files) only ever shows kernel /memfd:* entries (in-RAM tmpfs, not on disk).
cat /proc/<pid>/maps | grep "(deleted)" across all PIDs only shows the same kernel memfd entries.
cat /proc/<pid>/io write_bytes for all processes (npm, node, our browser subprocesses) is single-digit MB cumulative.
- The disk usage recovers fully ~40 seconds after the test driver process exits, gradually over ~10 seconds (e.g.
2.5G → 5.6G → 8.8G → 15G across 5 samples spaced 2s apart).
- All browser child processes are already reaped (
pgrep returns 0) at the time the disk starts recovering — so this is not "lingering processes holding mmap'd files."
- The same workload run on a local ubuntu 24 machine does not reproduce —
df stays flat across 50+ minutes of the same tests.
- Reproduces on both:
- native
ubuntu-24.04 hosted runner
debian:12 container running on the ubuntu-24.04 runner (which shares the host's /)
- Behavior is non-deterministic: re-running the same code on the same workflow sometimes shows the leak and sometimes doesn't, suggesting it's tied to the state of the underlying host VM.
Workflow context
Best guess
The disk is being held by something on the host that isn't visible from inside the runner's PID namespace — possibly the runner agent's diagnostic/log buffer being flushed/rotated periodically. The ~40s recovery delay is consistent with a periodic flush cycle on the agent's side. I have no way to verify this from inside the runner.
What I'd like to find out
- Is this a known runner-agent behavior?
- Is there documented expected disk overhead from the agent during high-output test runs?
- If applicable, is there a way to reduce/disable this buffering so jobs with heavy stdout don't trip
ENOSPC on the ~14 GB free root partition of the hosted runner?
Per-test df / du / lsof / /proc/<pid>/maps / /proc/<pid>/io samples and post-test polling data are available in disk-monitor-* artifacts on the linked PR's workflow runs. Happy to provide more or run additional diagnostics if helpful.
While debugging a flaky CI failure (
ENOSPC: no space left on device), I instrumented our workflow to sample disk usage every test. The result has been very surprising and I'd like to ask whether this is known runner-agent behavior or something to file with the image team instead.Symptoms
df /reports up to 12–15 GB used during the test run that cannot be accounted for by any standard tool:du -shxof every top-level directory on/(sum of all directories) doesn't grow.lsof +L1(deleted-but-open files) only ever shows kernel/memfd:*entries (in-RAM tmpfs, not on disk).cat /proc/<pid>/maps | grep "(deleted)"across all PIDs only shows the same kernel memfd entries.cat /proc/<pid>/iowrite_bytesfor all processes (npm, node, our browser subprocesses) is single-digit MB cumulative.2.5G → 5.6G → 8.8G → 15Gacross 5 samples spaced 2s apart).pgrepreturns 0) at the time the disk starts recovering — so this is not "lingering processes holding mmap'd files."dfstays flat across 50+ minutes of the same tests.ubuntu-24.04hosted runnerdebian:12container running on theubuntu-24.04runner (which shares the host's/)Workflow context
ubuntu-24.04(most recent).disk-monitor-*artifacts: https://github.com/microsoft/playwright-browsers/pull/2271Best guess
The disk is being held by something on the host that isn't visible from inside the runner's PID namespace — possibly the runner agent's diagnostic/log buffer being flushed/rotated periodically. The ~40s recovery delay is consistent with a periodic flush cycle on the agent's side. I have no way to verify this from inside the runner.
What I'd like to find out
ENOSPCon the ~14 GB free root partition of the hosted runner?Per-test
df/du/lsof//proc/<pid>/maps//proc/<pid>/iosamples and post-test polling data are available indisk-monitor-*artifacts on the linked PR's workflow runs. Happy to provide more or run additional diagnostics if helpful.