Skip to content

fix(okhttp): capture scope at AsyncCall.<init> to keep traces intact under virtual-thread Dispatcher#11479

Open
dougqh wants to merge 2 commits into
masterfrom
fix/okhttp-virtual-thread-dispatcher-trace-contamination
Open

fix(okhttp): capture scope at AsyncCall.<init> to keep traces intact under virtual-thread Dispatcher#11479
dougqh wants to merge 2 commits into
masterfrom
fix/okhttp-virtual-thread-dispatcher-trace-contamination

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 27, 2026

What Does This Do

  • Adds AsyncCallInstrumentation that captures the active scope at AsyncCall.<init> (i.e. on the user thread that called Call.enqueue(callback)) and stores it in the shared ContextStore<Runnable, State>. RunnableInstrumentation then activates that state on AsyncCall.run() entry, overriding whatever scope TaskRunner inherited from a dispatcher-recursion path.
  • Adds an OkHttp end-to-end concurrent-burst regression test (16 parents × 4 requests through a virtual-thread Dispatcher with capacity 4) plus single-shot baselines. This test fails on master and v1.61.1 without the fix and passes with it.
  • Adds pure-executor regression coverage for Thread.ofVirtual().factory(), the .name(prefix, start) builder variant, recursive worker-thread submission, and propagation across virtual-thread unmount.

Motivation

Reported by DataDog/profiling-backend#8520, which swapped its OkHttp dispatcher from Executors.newCachedThreadPool(...) to Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name(prefix, start).factory()) and started seeing broken traces.

OkHttp's Dispatcher recursively promotes queued AsyncCalls from inside Dispatcher.finished(), which runs on a dispatcher worker thread with a different call's parent scope still active (re-instated by TaskRunner.run()). The existing TaskRunnerInstrumentation captures that wrong scope when the promoted call is submitted to the executor, so under a virtual-thread-per-task dispatcher — or any executor where this recursion exposes itself — okhttp.request spans cross-contaminate between concurrent caller traces. Single-shot requests don't trigger it because nothing is queued.

The right capture point is the moment the user enqueued the call, where the right parent is active. That's exactly AsyncCall.<init>.

Deep dive: failure mode and propagation walk

Why the old code worked

Executors.newCachedThreadPool(...) returns a java.util.concurrent.ThreadPoolExecutor, covered by ThreadPoolExecutorInstrumentation. That instrumentation captures the active continuation into a State keyed by the AsyncCall at execute() time, then beforeExecute activates it on the worker. The okhttp.request span is created inside AsyncCall.run() with the parent scope reliably active.

Why the new code breaks

Executors.newThreadPerTaskExecutor(Thread.ofVirtual()...) returns java.util.concurrent.ThreadPerTaskExecutor, which is not a ThreadPoolExecutor:

  • ThreadPoolExecutorInstrumentation's extendsClass(named("java.util.concurrent.ThreadPoolExecutor")) doesn't match. No advice fires on its execute(...).
  • JavaExecutorInstrumentation doesn't match either — ThreadPerTaskExecutor is not in its knownMatchingTypes and dd.trace.executors.all is off by default.

The only things that fire are:

  • TaskRunnerInstrumentation — advises ThreadPerTaskExecutor$TaskRunner.<init> and run(). The constructor stashes a continuation in a State keyed by the TaskRunner, not the user's AsyncCall.
  • VirtualThreadInstrumentation — captures the whole Context at VirtualThread.<init>, swaps it in/out on each mount/unmount.

In principle this propagates the scope. In practice, both fire on whatever thread calls executor.execute(...). That's the user thread for the first call from a given caller — but it's a dispatcher worker for any call that gets queued.

Dispatcher recursion → cross-trace contamination

  1. Parent_A enqueues 4 calls. Dispatcher capacity is 4. They start.
  2. Parent_B (..., Parent_P) enqueue more calls. They queue in readyAsyncCalls.
  3. A call from Parent_A finishes on its virtual thread. Inside AsyncCall.run()'s finally, dispatcher.finished(this) runs — still on Parent_A's worker thread with Parent_A's scope active (reinstated by TaskRunner.run).
  4. finished() calls promoteAndExecute(), which calls executor.execute(nextCall) for a queued call belonging to some other parent.
  5. The agent's TaskRunnerInstrumentation.Construct captures whatever scope is active on the current worker — which is Parent_A, not the parent that actually called enqueue() for nextCall.
  6. That promoted call runs on a new virtual thread with Parent_A's scope, and its okhttp.request span lands in Parent_A's trace.

Repeat at scale → Parent_A's trace accumulates strays; Parent_B / Parent_C / etc. end up with empty traces. The single-shot tests miss it entirely because nothing is in the ready queue when enqueue() returns.

Empirical confirmation

Concurrent-burst test failures before the fix (same source, just on different branches):

Branch Failure
master trace rooted at parent 1 has wrong child count ==> expected: <4> but was: <0> (that parent's calls were stolen, its trace ended up empty)
v1.61.1 trace rooted at parent 1 has wrong child count ==> expected: <4> but was: <17> (that parent absorbed 13 strays from siblings)

How the fix propagates correctly

With AsyncCallInstrumentation.Construct in place:

  1. User thread, Parent_Y active: realCall.enqueue(cb)new AsyncCall(cb)Construct advice fires → captures Parent_Y's continuation on the AsyncCall in ContextStore<Runnable, State>. ✓
  2. AsyncCall is handed to the dispatcher. May execute immediately or queue.
  3. Later: dispatcher worker thread, Parent_A active. executor.execute(asyncCall_Y)TaskRunner.<init> → captures Parent_A on the TaskRunner. (Irrelevant noise — we don't try to fix TaskRunner.)
  4. Virtual thread runs TaskRunner.run():
    • TaskRunnerInstrumentation.Run activates Parent_A on the worker.
    • TaskRunner.run() calls asyncCall_Y.run().
    • RunnableInstrumentation.RunnableAdvice fires on entry → reads the State stored in step 1 → activates Parent_Y on top of Parent_A.
  5. getResponseWithInterceptorChain() runs the TracingInterceptoractiveSpan() returns Parent_Y → okhttp.request is parented under Parent_Y. ✓
  6. On exit: RunnableAdvice closes Parent_Y's scope, then TaskRunner closes Parent_A's. Stack unwinds cleanly.

Two key properties make this safe:

  • AdviceUtils.capture only writes a continuation if one isn't already there and async propagation is on, so we never overwrite a previously captured state.
  • RunnableInstrumentation's advice runs inside TaskRunner.run(), after TaskRunner's activation. That inner scope wins for the entire duration of asyncCall.run() — which is exactly when the okhttp.request span is created.

Compatibility

Covers both class locations:

  • OkHttp 3.x: okhttp3.RealCall$AsyncCall
  • OkHttp 4.x: okhttp3.internal.connection.RealCall$AsyncCall

Muzzle passes against 24 OkHttp versions in [3.0.0, 5.3.2]. The capture re-uses the existing "okhttp" / "okhttp-3" instrumentation names — no new feature flag, so OkHttp tracing being enabled implies this capture is enabled.

Test plan

  • :dd-java-agent:instrumentation:okhttp:okhttp-3.0:vthread21Test — new suite, 3 tests pass
  • :dd-java-agent:instrumentation:okhttp:okhttp-3.0:test — IAST suite, 2 tests pass
  • :dd-java-agent:instrumentation:okhttp:okhttp-3.0:forkedTest — 168 pre-existing OkHttp forked tests still pass (sync + async, V0 + V1)
  • :dd-java-agent:instrumentation:java:java-concurrent:java-concurrent-21.0:testThreadPerTaskExecutorVirtualThreadTest (4 new) + VirtualThreadPerTaskExecutorTest (9 pre-existing) all pass
  • :dd-java-agent:instrumentation:okhttp:okhttp-3.0:muzzle — passes against 24 OkHttp versions
  • Verified in profiling-backend staging deploy

🤖 Generated with Claude Code

@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 27, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | test_inst: [tip, 8/8]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). 4 tests failed due to TimeoutException in OkHttpVirtualThreadDispatcherTest.java:292.

DataDog/apm-reliability/dd-trace-java | test_smoke_graalvm: [graalvm25]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Failed to read workspace metadata from '/go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/0129974bef05e596ff523f39885deddc/metadata.bin'.

DataDog/apm-reliability/dd-trace-java | test_inst: [25, 8/8]   View in Datadog   GitLab

See error 4 tests failed in OkHttpVirtualThreadDispatcherTest. Execution failed for task ':dd-java-agent:instrumentation:okhttp:okhttp-3.0:vthread21Test'.

View all 5 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 84979ca | Docs | Datadog PR Page | Give us feedback!

@dougqh dougqh marked this pull request as ready for review May 27, 2026 18:46
@dougqh dougqh requested review from a team as code owners May 27, 2026 18:46
@dougqh dougqh requested review from sarahchen6 and vandonr and removed request for a team May 27, 2026 18:46
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 27, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@dd-octo-sts dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label May 27, 2026
@dougqh dougqh requested review from PerfectSlayer and mcculls May 27, 2026 18:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f5eb8a9e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +25 to +28
tasks.named("vthread21Test", Test) {
javaLauncher = javaToolchains.launcherFor {
languageVersion = JavaLanguageVersion.of(21)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add JDK 21 constraints for the virtual-thread suite

This suite is forced to run with a JDK 21 launcher, but it does not declare a task-specific testJvmConstraints { minJavaVersion = JavaVersion.VERSION_21 }. The repo's test-JVM plugin gates tasks from the requested -PtestJvm, not from a later launcher override, so lower-JDK shards such as -PtestJvm=8/11 still consider vthread21Test allowed and then run or resolve this JDK 21-only suite from check. Other per-suite JDK 21 tests in this repo add the constraint on the Test task; doing the same here keeps lower-JDK checks from unexpectedly executing this virtual-thread coverage.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataDog fix this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only run on private repositories.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 27, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario This PR master Change
insecure-bank / iast 13,953 ms 13,968 ms -0.1%
insecure-bank / tracing 12,912 ms 12,960 ms -0.4%
petclinic / appsec 16,465 ms 16,265 ms +1.2%
petclinic / iast 15,503 ms 16,510 ms -6.1%
petclinic / profiling 16,199 ms 16,311 ms -0.7%
petclinic / tracing 14,991 ms 15,201 ms -1.4%

Commit: 84979cad · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@dougqh dougqh added type: bug Bug report and fix inst:okhttp labels May 27, 2026
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 28, 2026

BitsAI applied the CI fix

The fix has been applied to this PR without validation.

Progress

🟢 Investigated · 🟢 Fix prepared · ⚪ Validation skipped · 🟢 Applied


Commit: 8f5eb8a

View session in Datadog

…under virtual-thread Dispatcher load

OkHttp's Dispatcher recursively promotes queued AsyncCalls from inside
finished(), running on a worker thread that has a *different* call's
parent scope active. The existing TaskRunnerInstrumentation captures
that wrong scope when the promoted call is executed, so under a
virtual-thread-per-task Dispatcher (or any executor where the same
recursion exposes the issue) okhttp.request spans cross-contaminate
between concurrent caller traces. Single-shot requests don't trigger
this because nothing is queued.

Fix: add AsyncCallInstrumentation that captures the active scope at
AsyncCall.<init> (i.e. the user thread that ran enqueue(), where the
right parent is active) and stores it in the shared
ContextStore<Runnable, State>. RunnableInstrumentation then activates
that state on AsyncCall.run() entry, overriding whatever scope
TaskRunner inherited from the dispatcher-recursion path.

Covers both class locations (3.x and 4.x). Muzzle passes against 24
OkHttp versions in [3.0.0, 5.3.2]. All 168 pre-existing forked OkHttp
tests still pass.

Tests added:
  * okhttp-3.0/src/vthread21Test/.../OkHttpVirtualThreadDispatcherTest:
    concurrent-burst test (16 parents x 4 requests through a virtual-
    thread Dispatcher with capacity 4) plus two single-shot baselines.
    The concurrent-burst test fails on master and v1.61.1 without the
    fix and passes with it.
  * java-concurrent-21.0/src/test/java/ThreadPerTaskExecutorVirtualThreadTest:
    pure-executor regression coverage for Thread.ofVirtual().factory(),
    the .name(prefix, start) builder variant, recursive submission, and
    propagation across virtual-thread unmount.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh force-pushed the fix/okhttp-virtual-thread-dispatcher-trace-contamination branch from eaf3d86 to 58311a9 Compare May 29, 2026 03:59
Copy link
Copy Markdown
Contributor

@amarziali amarziali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the advice and I had two tests over 3 still green. Only concurrentVirtualThreadPerTaskDispatcher_keepsEachTraceSeparate failed. I can suggest to review what's tested and also there are okhttp 4.x related classes that are matched but the test are only run on 3.11.0

Comment on lines +31 to +33
tasks.named("check") {
dependsOn "vthread21Test"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is needed? Normally you should use the testJvmConstraints to asses min jvm version for testing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that doesn't quite work here because it is a mix of okhttp and virtual threads that we need to test. The fix is with okhttp, but using virtual threads requires Java 21.


// Separate suite for tests that need the JDK 21+ virtual-thread API. Kept apart from the
// default test suite so that the existing OkHttp 3.0 tests keep their Java 1.8 baseline.
addTestSuite('vthread21Test')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test can simply put under test - no need to have a different test suite

public String[] knownMatchingTypes() {
return new String[] {
"okhttp3.RealCall$AsyncCall", // OkHttp 3.x
"okhttp3.internal.connection.RealCall$AsyncCall", // OkHttp 4.x
Copy link
Copy Markdown
Contributor

@amarziali amarziali May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the gradle build there are no tests at all for 4.x. So how this can be verified?

* {@code TracingInterceptor.intercept()} on the executor thread using {@code activeSpan()}, so "is
* the parent scope active on the worker?" is the entire question.
*/
class ThreadPerTaskExecutorVirtualThreadTest extends AbstractInstrumentationTest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to okhttp (directly). I would recommend to add in a different PR

…t suite

- Replace javaLauncher override with testJvmConstraints so lower-JDK
  shards skip vthread21Test rather than forcing JDK 21
- Drop OkHttp 4.x from AsyncCallInstrumentation.knownMatchingTypes()
  until test coverage exists; trim stale 4.x javadoc
- Remove ThreadPerTaskExecutorVirtualThreadTest (no OkHttp dependency,
  belongs in a separate PR)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst:okhttp tag: ai generated Largely based on code generated by an AI or LLM type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants