Self-hosted Worker hangs holding runner slot after CompleteJobAsync hits 'workflow instance not found'

## Summary

On a 3-host self-hosted macOS runner pool (Apple Silicon, v2.334.0), `Runner.Worker` processes regularly become wedged at the end of a job: the build completes, but `JobRunner.CompleteJobAsync` fails with `TaskOrchestrationJobNotFoundException: workflow instance not found`, the configured retries exhaust, and the Worker then **fails to exit** — it keeps the parent `Runner.Listener` slot in a `busy=true` state indefinitely. The Listener won't spawn a new Worker until the wedged one exits, so the affected runner's queue stalls until external intervention.

This appears to be in the V2 RunService completion path (`useV2Flow: true`, `RunServiceHttpClient.CompleteJobAsync`).

## Frequency / Impact

Auditing `_diag/Worker_*.log` files across one host (`ac-pro-01`) since April 22, 2026: **50 of 152 Worker invocations affected (32.8%)**. Comparable rates on the other two hosts (51/?, 43/?). One occurrence today wedged all three Workers simultaneously, blocking CI for 3+ hours.

These weren't transient — each affected Worker stayed alive holding the runner-slot until either reboot or a watchdog we wrote killed it. Slot starvation is the symptom users see; the underlying API exception is what causes it.

## Stack trace

```
[2026-05-13 19:50:39Z ERR  JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 99e3e9b6-22cb-535b-b919-c97958bc7899. workflow instance not found
   at GitHub.Actions.RunService.WebApi.RunServiceHttpClient.CompleteJobAsync(Uri requestUri, Guid planId, Guid jobId, TaskResult conclusion, Dictionary`2 outputs, IList`1 stepResults, IList`1 jobAnnotations, String environmentUrl, IList`1 telemetry, String billingOwnerId, String infrastructureFailureCategory, CancellationToken cancellationToken)
   at GitHub.Runner.Common.RunServer.<>c__DisplayClass7_0.<<CompleteJobAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.<>c__DisplayClass12_0.<<RetryRequest>g__wrappedFunc|0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
   at GitHub.Runner.Common.RunnerService.RetryRequest(Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
   at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IRunServer runServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable`1 taskResult)
```

Followed by:

```
[2026-05-13 19:50:39Z ERR  Worker] System.AggregateException: One or more errors occurred. (Job not found: ...)
... same chain wrapped ...
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
[2026-05-13 19:50:39Z ERR  Worker] #####################################################
```

After this point the Worker process **stays alive but does no further work** — `ps` shows it at ~0.1% CPU, no `xcodebuild` / `swift` descendants. It will not exit on its own; the `Runner.Listener` does not spawn a new Worker because the old one is still present.

## Expected behavior

When `CompleteJobAsync` exhausts its configured `maxAttempts` against an unrecoverable error (job not found on the server is *not* recoverable — the orchestrator has forgotten this job), the Worker should **exit with a non-zero status** so the Listener can spawn a fresh Worker and continue serving the queue.

## Actual behavior

The Worker logs the AggregateException, prints the `#####` separator line, and then ceases progress without terminating. Its parent Listener treats the slot as `busy`, the GitHub Actions API reports the runner as busy, and the queue stalls indefinitely.

## Environment

- Runner version: `2.334.0` (also reproduced on `2.333.1` per older `bin/` snapshots)
- OS: macOS Sequoia 15.x, Apple Silicon (M4 Pro / M4)
- Self-hosted, configured via `config.sh` against a public repo
- `useV2Flow: true` (per `.runner` config — RunService path)
- Service-mode (`actions.runner.*.plist` LaunchAgent)

## Workaround we deployed

External watchdog that `SIGKILL`s any `Runner.Worker` matching either gate:
1. Elapsed > 20 min AND no `xcodebuild`/`swift-driver`/`swift-frontend`/`swiftc`/`xctest`/`xcrun` descendant
2. Elapsed > 10 min AND a `tar`/`bsdtar`/`git`/`curl`/`gh` descendant has been running > 5 min with < 5 s of cumulative CPU (covers a separate but related `actions/cache` hang)

Within 4 hours of deployment the watchdog fired 6 times across the 3 hosts — every kill corresponded to this exact `workflow instance not found` retry-loop pattern. Source: https://github.com/lbgraham/claudine/tree/main/scripts/fleet/runner-watchdog

A watchdog should not be necessary — the runner itself should bail out of the unrecoverable retry loop.

## Suggested fixes (in order of preference)

1. **Treat `TaskOrchestrationJobNotFoundException` as terminal in `RetryRequest`'s `shouldRetry` predicate.** The server has lost the workflow; retrying cannot succeed. Exit the Worker immediately with a logged failure.
2. **Independent of (1), have `Worker.Program.MainAsync` ensure `Environment.Exit(nonZero)` is called when `JobRunner.RunAsync` throws.** Today the Worker apparently catches/swallows somewhere downstream and the process keeps running. Whatever the cleanup path is, it shouldn't leave the process resident.
3. **As a defense-in-depth, the Listener should detect a Worker that hasn't sent a heartbeat / log line in > N minutes and kill+respawn it.**

Happy to provide additional `_diag` logs (Listener + Worker) on request — the affected hosts have ~3 weeks of history.

## Related

- #4357 (open, April 2026) — different symptom in Worker lifecycle, but adjacent area
- #3862 (closed, May 2025) — earlier "lost communication: NotFound" report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosted Worker hangs holding runner slot after CompleteJobAsync hits 'workflow instance not found' #4418

Summary

Frequency / Impact

Stack trace

Expected behavior

Actual behavior

Environment

Workaround we deployed

Suggested fixes (in order of preference)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Self-hosted Worker hangs holding runner slot after CompleteJobAsync hits 'workflow instance not found' #4418

Description

Summary

Frequency / Impact

Stack trace

Expected behavior

Actual behavior

Environment

Workaround we deployed

Suggested fixes (in order of preference)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions