Skip to content

Self-hosted Worker hangs holding runner slot after CompleteJobAsync hits 'workflow instance not found' #4418

@lbgraham

Description

@lbgraham

Summary

On a 3-host self-hosted macOS runner pool (Apple Silicon, v2.334.0), Runner.Worker processes regularly become wedged at the end of a job: the build completes, but JobRunner.CompleteJobAsync fails with TaskOrchestrationJobNotFoundException: workflow instance not found, the configured retries exhaust, and the Worker then fails to exit — it keeps the parent Runner.Listener slot in a busy=true state indefinitely. The Listener won't spawn a new Worker until the wedged one exits, so the affected runner's queue stalls until external intervention.

This appears to be in the V2 RunService completion path (useV2Flow: true, RunServiceHttpClient.CompleteJobAsync).

Frequency / Impact

Auditing _diag/Worker_*.log files across one host (ac-pro-01) since April 22, 2026: 50 of 152 Worker invocations affected (32.8%). Comparable rates on the other two hosts (51/?, 43/?). One occurrence today wedged all three Workers simultaneously, blocking CI for 3+ hours.

These weren't transient — each affected Worker stayed alive holding the runner-slot until either reboot or a watchdog we wrote killed it. Slot starvation is the symptom users see; the underlying API exception is what causes it.

Stack trace

[2026-05-13 19:50:39Z ERR  JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 99e3e9b6-22cb-535b-b919-c97958bc7899. workflow instance not found
   at GitHub.Actions.RunService.WebApi.RunServiceHttpClient.CompleteJobAsync(Uri requestUri, Guid planId, Guid jobId, TaskResult conclusion, Dictionary`2 outputs, IList`1 stepResults, IList`1 jobAnnotations, String environmentUrl, IList`1 telemetry, String billingOwnerId, String infrastructureFailureCategory, CancellationToken cancellationToken)
   at GitHub.Runner.Common.RunServer.<>c__DisplayClass7_0.<<CompleteJobAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.<>c__DisplayClass12_0.<<RetryRequest>g__wrappedFunc|0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
   at GitHub.Runner.Common.RunnerService.RetryRequest(Func`1 func, CancellationToken cancellationToken, Int32 maxAttempts, Func`2 shouldRetry)
   at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IRunServer runServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable`1 taskResult)

Followed by:

[2026-05-13 19:50:39Z ERR  Worker] System.AggregateException: One or more errors occurred. (Job not found: ...)
... same chain wrapped ...
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
[2026-05-13 19:50:39Z ERR  Worker] #####################################################

After this point the Worker process stays alive but does no further workps shows it at ~0.1% CPU, no xcodebuild / swift descendants. It will not exit on its own; the Runner.Listener does not spawn a new Worker because the old one is still present.

Expected behavior

When CompleteJobAsync exhausts its configured maxAttempts against an unrecoverable error (job not found on the server is not recoverable — the orchestrator has forgotten this job), the Worker should exit with a non-zero status so the Listener can spawn a fresh Worker and continue serving the queue.

Actual behavior

The Worker logs the AggregateException, prints the ##### separator line, and then ceases progress without terminating. Its parent Listener treats the slot as busy, the GitHub Actions API reports the runner as busy, and the queue stalls indefinitely.

Environment

  • Runner version: 2.334.0 (also reproduced on 2.333.1 per older bin/ snapshots)
  • OS: macOS Sequoia 15.x, Apple Silicon (M4 Pro / M4)
  • Self-hosted, configured via config.sh against a public repo
  • useV2Flow: true (per .runner config — RunService path)
  • Service-mode (actions.runner.*.plist LaunchAgent)

Workaround we deployed

External watchdog that SIGKILLs any Runner.Worker matching either gate:

  1. Elapsed > 20 min AND no xcodebuild/swift-driver/swift-frontend/swiftc/xctest/xcrun descendant
  2. Elapsed > 10 min AND a tar/bsdtar/git/curl/gh descendant has been running > 5 min with < 5 s of cumulative CPU (covers a separate but related actions/cache hang)

Within 4 hours of deployment the watchdog fired 6 times across the 3 hosts — every kill corresponded to this exact workflow instance not found retry-loop pattern. Source: https://github.com/lbgraham/claudine/tree/main/scripts/fleet/runner-watchdog

A watchdog should not be necessary — the runner itself should bail out of the unrecoverable retry loop.

Suggested fixes (in order of preference)

  1. Treat TaskOrchestrationJobNotFoundException as terminal in RetryRequest's shouldRetry predicate. The server has lost the workflow; retrying cannot succeed. Exit the Worker immediately with a logged failure.
  2. Independent of (1), have Worker.Program.MainAsync ensure Environment.Exit(nonZero) is called when JobRunner.RunAsync throws. Today the Worker apparently catches/swallows somewhere downstream and the process keeps running. Whatever the cleanup path is, it shouldn't leave the process resident.
  3. As a defense-in-depth, the Listener should detect a Worker that hasn't sent a heartbeat / log line in > N minutes and kill+respawn it.

Happy to provide additional _diag logs (Listener + Worker) on request — the affected hosts have ~3 weeks of history.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions