df / reports up to 15 GB of disk used that no file or process accounts for on ubuntu-24.04 runner

While debugging a flaky CI failure (`ENOSPC: no space left on device`), I instrumented our workflow to sample disk usage every test. The result has been very surprising and I'd like to ask whether this is known runner-agent behavior or something to file with the image team instead.

## Symptoms

- `df /` reports up to **12–15 GB used** during the test run that **cannot be accounted for** by any standard tool:
  - `du -shx` of every top-level directory on `/` (sum of all directories) doesn't grow.
  - `lsof +L1` (deleted-but-open files) only ever shows kernel `/memfd:*` entries (in-RAM tmpfs, not on disk).
  - `cat /proc/<pid>/maps | grep "(deleted)"` across all PIDs only shows the same kernel memfd entries.
  - `cat /proc/<pid>/io` `write_bytes` for all processes (npm, node, our browser subprocesses) is single-digit MB cumulative.
- The disk usage **recovers fully ~40 seconds after the test driver process exits**, gradually over ~10 seconds (e.g. `2.5G → 5.6G → 8.8G → 15G` across 5 samples spaced 2s apart).
- All browser child processes are already reaped (`pgrep` returns 0) at the time the disk *starts* recovering — so this is **not** "lingering processes holding mmap'd files."
- The same workload run on a local ubuntu 24 machine **does not reproduce** — `df` stays flat across 50+ minutes of the same tests.
- Reproduces on both:
  - native `ubuntu-24.04` hosted runner
  - `debian:12` container running on the `ubuntu-24.04` runner (which shares the host's `/`)
- Behavior is non-deterministic: re-running the same code on the same workflow sometimes shows the leak and sometimes doesn't, suggesting it's tied to the state of the underlying host VM.

## Workflow context

- Workload: playwright's WebKit (WPE) browser test suite (~30 min of browser test execution, many short-lived child processes).
- Image: `ubuntu-24.04` (most recent).
- Workflow + inline instrumentation: https://github.com/microsoft/playwright-browsers/blob/debian-test-disk-debug/.github/workflows/test-pr-webkit.yml
- PR with failing runs and `disk-monitor-*` artifacts: https://github.com/microsoft/playwright-browsers/pull/2271

## Best guess

The disk is being held by something on the host that isn't visible from inside the runner's PID namespace — possibly the runner agent's diagnostic/log buffer being flushed/rotated periodically. The ~40s recovery delay is consistent with a periodic flush cycle on the agent's side. I have no way to verify this from inside the runner.

## What I'd like to find out

- Is this a known runner-agent behavior?
- Is there documented expected disk overhead from the agent during high-output test runs?
- If applicable, is there a way to reduce/disable this buffering so jobs with heavy stdout don't trip `ENOSPC` on the ~14 GB free root partition of the hosted runner?

Per-test `df` / `du` / `lsof` / `/proc/<pid>/maps` / `/proc/<pid>/io` samples and post-test polling data are available in `disk-monitor-*` artifacts on the linked PR's workflow runs. Happy to provide more or run additional diagnostics if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df / reports up to 15 GB of disk used that no file or process accounts for on ubuntu-24.04 runner #4448

Symptoms

Workflow context

Best guess

What I'd like to find out

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

df / reports up to 15 GB of disk used that no file or process accounts for on ubuntu-24.04 runner #4448

Description

Symptoms

Workflow context

Best guess

What I'd like to find out

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions