Summary
On a 3-host self-hosted macOS runner pool (Apple Silicon, v2.334.0), Runner.Worker processes regularly become wedged at the end of a job: the build completes, but JobRunner.CompleteJobAsync fails with TaskOrchestrationJobNotFoundException: workflow instance not found, the configured retries exhaust, and the Worker then fails to exit — it keeps the parent Runner.Listener slot in a busy=true state indefinitely. The Listener won't spawn a new Worker until the wedged one exits, so the affected runner's queue stalls until external intervention.
This appears to be in the V2 RunService completion path (useV2Flow: true, RunServiceHttpClient.CompleteJobAsync).
Frequency / Impact
Auditing _diag/Worker_*.log files across one host (ac-pro-01) since April 22, 2026: 50 of 152 Worker invocations affected (32.8%). Comparable rates on the other two hosts (51/?, 43/?). One occurrence today wedged all three Workers simultaneously, blocking CI for 3+ hours.
These weren't transient — each affected Worker stayed alive holding the runner-slot until either reboot or a watchdog we wrote killed it. Slot starvation is the symptom users see; the underlying API exception is what causes it.
Stack trace
[2026-05-13 19:50:39Z ERR JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 99e3e9b6-22cb-535b-b919-c97958bc7899. workflow instance not found
at GitHub.Actions.RunService.WebApi.RunServiceHttpClient.CompleteJobAsync(Uri requestUri, Guid planId, Guid jobId, TaskResult conclusion, Dictionary`2 outputs, IList`1 stepResults, IList`1 jobAnnotations, String environmentUrl, IList`1 telemetry, String billingOwnerId, String infrastructureFailureCategory, CancellationToken cancellationToken)
at GitHub.Runner.Common.RunServer.<>c__DisplayClass7_0.<<CompleteJobAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
at GitHub.Runner.Common.RunnerService.<>c__DisplayClass12_0.<<RetryRequest>g__wrappedFunc|0>d.MoveNext()
--- End of stack trace from previous location ---
at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
at GitHub.Runner.Common.RunnerService.RetryRequest(Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IRunServer runServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable`1 taskResult)
Followed by:
[2026-05-13 19:50:39Z ERR Worker] System.AggregateException: One or more errors occurred. (Job not found: ...)
... same chain wrapped ...
at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
[2026-05-13 19:50:39Z ERR Worker] #####################################################
After this point the Worker process stays alive but does no further work — ps shows it at ~0.1% CPU, no xcodebuild / swift descendants. It will not exit on its own; the Runner.Listener does not spawn a new Worker because the old one is still present.
Expected behavior
When CompleteJobAsync exhausts its configured maxAttempts against an unrecoverable error (job not found on the server is not recoverable — the orchestrator has forgotten this job), the Worker should exit with a non-zero status so the Listener can spawn a fresh Worker and continue serving the queue.
Actual behavior
The Worker logs the AggregateException, prints the ##### separator line, and then ceases progress without terminating. Its parent Listener treats the slot as busy, the GitHub Actions API reports the runner as busy, and the queue stalls indefinitely.
Environment
- Runner version:
2.334.0 (also reproduced on 2.333.1 per older bin/ snapshots)
- OS: macOS Sequoia 15.x, Apple Silicon (M4 Pro / M4)
- Self-hosted, configured via
config.sh against a public repo
useV2Flow: true (per .runner config — RunService path)
- Service-mode (
actions.runner.*.plist LaunchAgent)
Workaround we deployed
External watchdog that SIGKILLs any Runner.Worker matching either gate:
- Elapsed > 20 min AND no
xcodebuild/swift-driver/swift-frontend/swiftc/xctest/xcrun descendant
- Elapsed > 10 min AND a
tar/bsdtar/git/curl/gh descendant has been running > 5 min with < 5 s of cumulative CPU (covers a separate but related actions/cache hang)
Within 4 hours of deployment the watchdog fired 6 times across the 3 hosts — every kill corresponded to this exact workflow instance not found retry-loop pattern. Source: https://github.com/lbgraham/claudine/tree/main/scripts/fleet/runner-watchdog
A watchdog should not be necessary — the runner itself should bail out of the unrecoverable retry loop.
Suggested fixes (in order of preference)
- Treat
TaskOrchestrationJobNotFoundException as terminal in RetryRequest's shouldRetry predicate. The server has lost the workflow; retrying cannot succeed. Exit the Worker immediately with a logged failure.
- Independent of (1), have
Worker.Program.MainAsync ensure Environment.Exit(nonZero) is called when JobRunner.RunAsync throws. Today the Worker apparently catches/swallows somewhere downstream and the process keeps running. Whatever the cleanup path is, it shouldn't leave the process resident.
- As a defense-in-depth, the Listener should detect a Worker that hasn't sent a heartbeat / log line in > N minutes and kill+respawn it.
Happy to provide additional _diag logs (Listener + Worker) on request — the affected hosts have ~3 weeks of history.
Related
Summary
On a 3-host self-hosted macOS runner pool (Apple Silicon, v2.334.0),
Runner.Workerprocesses regularly become wedged at the end of a job: the build completes, butJobRunner.CompleteJobAsyncfails withTaskOrchestrationJobNotFoundException: workflow instance not found, the configured retries exhaust, and the Worker then fails to exit — it keeps the parentRunner.Listenerslot in abusy=truestate indefinitely. The Listener won't spawn a new Worker until the wedged one exits, so the affected runner's queue stalls until external intervention.This appears to be in the V2 RunService completion path (
useV2Flow: true,RunServiceHttpClient.CompleteJobAsync).Frequency / Impact
Auditing
_diag/Worker_*.logfiles across one host (ac-pro-01) since April 22, 2026: 50 of 152 Worker invocations affected (32.8%). Comparable rates on the other two hosts (51/?, 43/?). One occurrence today wedged all three Workers simultaneously, blocking CI for 3+ hours.These weren't transient — each affected Worker stayed alive holding the runner-slot until either reboot or a watchdog we wrote killed it. Slot starvation is the symptom users see; the underlying API exception is what causes it.
Stack trace
Followed by:
After this point the Worker process stays alive but does no further work —
psshows it at ~0.1% CPU, noxcodebuild/swiftdescendants. It will not exit on its own; theRunner.Listenerdoes not spawn a new Worker because the old one is still present.Expected behavior
When
CompleteJobAsyncexhausts its configuredmaxAttemptsagainst an unrecoverable error (job not found on the server is not recoverable — the orchestrator has forgotten this job), the Worker should exit with a non-zero status so the Listener can spawn a fresh Worker and continue serving the queue.Actual behavior
The Worker logs the AggregateException, prints the
#####separator line, and then ceases progress without terminating. Its parent Listener treats the slot asbusy, the GitHub Actions API reports the runner as busy, and the queue stalls indefinitely.Environment
2.334.0(also reproduced on2.333.1per olderbin/snapshots)config.shagainst a public repouseV2Flow: true(per.runnerconfig — RunService path)actions.runner.*.plistLaunchAgent)Workaround we deployed
External watchdog that
SIGKILLs anyRunner.Workermatching either gate:xcodebuild/swift-driver/swift-frontend/swiftc/xctest/xcrundescendanttar/bsdtar/git/curl/ghdescendant has been running > 5 min with < 5 s of cumulative CPU (covers a separate but relatedactions/cachehang)Within 4 hours of deployment the watchdog fired 6 times across the 3 hosts — every kill corresponded to this exact
workflow instance not foundretry-loop pattern. Source: https://github.com/lbgraham/claudine/tree/main/scripts/fleet/runner-watchdogA watchdog should not be necessary — the runner itself should bail out of the unrecoverable retry loop.
Suggested fixes (in order of preference)
TaskOrchestrationJobNotFoundExceptionas terminal inRetryRequest'sshouldRetrypredicate. The server has lost the workflow; retrying cannot succeed. Exit the Worker immediately with a logged failure.Worker.Program.MainAsyncensureEnvironment.Exit(nonZero)is called whenJobRunner.RunAsyncthrows. Today the Worker apparently catches/swallows somewhere downstream and the process keeps running. Whatever the cleanup path is, it shouldn't leave the process resident.Happy to provide additional
_diaglogs (Listener + Worker) on request — the affected hosts have ~3 weeks of history.Related