Skip to content

KAFKA-20585 : Share Fetch latency metric incorrectly updated during ShareAcknowledgeResponse#22296

Open
muralibasani wants to merge 1 commit into
apache:trunkfrom
muralibasani:KAFKA-20585
Open

KAFKA-20585 : Share Fetch latency metric incorrectly updated during ShareAcknowledgeResponse#22296
muralibasani wants to merge 1 commit into
apache:trunkfrom
muralibasani:KAFKA-20585

Conversation

@muralibasani
Copy link
Copy Markdown
Contributor

Ref : https://issues.apache.org/jira/browse/KAFKA-20585

fetch latency metric is recorded twice. In handleShareFetchSuccess and handleShareAcknowledgeSuccess.

  • Removing from handleShareAcknowledgeSuccess

@github-actions github-actions Bot added triage PRs from the community consumer clients small Small PRs labels May 15, 2026
Copy link
Copy Markdown
Contributor

@nileshkumar3 nileshkumar3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for PR, +1 on the fix.

It looks like there is no dedicated ack latency sensor in ShareFetchMetricsManager today. If ack latency is useful to track, this may be worth a follow-up JIRA rather than reusing fetch metrics.

Also, consider adding a test asserting that fetch-rate/fetch-total do not change after a successful ShareAcknowledge response, to guard against regression.

@smjn smjn requested a review from apoorvmittal10 May 16, 2026 20:16
@github-actions github-actions Bot removed the triage PRs from the community label May 17, 2026
@muralibasani
Copy link
Copy Markdown
Contributor Author

Thanks for PR, +1 on the fix.

It looks like there is no dedicated ack latency sensor in ShareFetchMetricsManager today. If ack latency is useful to track, this may be worth a follow-up JIRA rather than reusing fetch metrics.

Also, consider adding a test asserting that fetch-rate/fetch-total do not change after a successful ShareAcknowledge response, to guard against regression.

@nileshkumar3 thanks for the review. IMO this was copy pasted by mistake and we ended up with this duplicate in ack method. We might have other metrics in these round trip situations. If you still feel it's worth adding, I will.

@muralibasani
Copy link
Copy Markdown
Contributor Author

Thanks for PR, +1 on the fix.

It looks like there is no dedicated ack latency sensor in ShareFetchMetricsManager today. If ack latency is useful to track, this may be worth a follow-up JIRA rather than reusing fetch metrics.

Also, consider adding a test asserting that fetch-rate/fetch-total do not change after a successful ShareAcknowledge response, to guard against regression.

Reg the follow-up jira, I think it makes sense. acknowledge-latency wdyt @chia7712 ?

final ShareFetchResponse response = (ShareFetchResponse) resp.responseBody();
final ShareSessionHandler handler = sessionHandler(fetchTarget.id());

metricsManager.recordLatency(resp.destination(), resp.requestLatencyMs());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move it to finally ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it to finally will record the metric even if above lines 817 or 818 fail (while getting response and handler), and we don't want that right.

Copy link
Copy Markdown
Contributor

@nileshkumar3 nileshkumar3 May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the JIRA intent of capturing top-level errors too, IMO finally is a cleaner fit. Curious what @chia7712 / @apoorvmittal10 think.

@nileshkumar3
Copy link
Copy Markdown
Contributor

Thanks for PR, +1 on the fix.
It looks like there is no dedicated ack latency sensor in ShareFetchMetricsManager today. If ack latency is useful to track, this may be worth a follow-up JIRA rather than reusing fetch metrics.
Also, consider adding a test asserting that fetch-rate/fetch-total do not change after a successful ShareAcknowledge response, to guard against regression.

@nileshkumar3 thanks for the review. IMO this was copy pasted by mistake and we ended up with this duplicate in ack method. We might have other metrics in these round trip situations. If you still feel it's worth adding, I will.

IMO, adding a regression test for this issue worths, other metrics may be a follow up.

final FetchResponse response = (FetchResponse) resp.responseBody();
final FetchSessionHandler handler = sessionHandler(fetchTarget.id());

metricsManager.recordLatency(resp.destination(), resp.requestLatencyMs());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it to finally will record the metric even if above lines 155 or 156 fail (while getting response and handler), and we don't want that right., same as above

metrics = new Metrics(metricConfig, time);
shareFetchMetricsRegistry = new ShareFetchMetricsRegistry(metricConfig.tags().keySet(), "consumer-share" + groupId);
metricsManager = new ShareFetchMetricsManager(metrics, shareFetchMetricsRegistry);
metricsManager = spy(new ShareFetchMetricsManager(metrics, shareFetchMetricsRegistry));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using spy here because the existing tests need the real metric registry to work

final FetchResponse response = (FetchResponse) resp.responseBody();
final FetchSessionHandler handler = sessionHandler(fetchTarget.id());

metricsManager.recordLatency(resp.destination(), resp.requestLatencyMs());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should fix the gap for the Async and Classic consumer I expect.

We would just need to add tests to validate it (same as the one added here for the share consumer, but using an error like FETCH_SESSION_TOPIC_ID_ERROR maybe), added to FetcherTest for the classic, and in FetchRequestManagerTest for the async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants