Skip to content

Get instance id for desired control-queue(s)#1069

Open
pasaini-microsoft wants to merge 11 commits into
Azure:mainfrom
pasaini-microsoft:users/pasaini/main/InstanceIdForSelectedControlQueue
Open

Get instance id for desired control-queue(s)#1069
pasaini-microsoft wants to merge 11 commits into
Azure:mainfrom
pasaini-microsoft:users/pasaini/main/InstanceIdForSelectedControlQueue

Conversation

@pasaini-microsoft
Copy link
Copy Markdown

@pasaini-microsoft pasaini-microsoft commented Apr 19, 2024

Motivation

#1079

Issue: No way of targeting an orchestrator instance to a desired control-queue.

  • We have been facing issues where DTF orchestration used to get stuck at random. Given that customer load is not very regular in our service, it was challenging to understand upfront if the orchestration would be processed or will be stuck.
  • More often customers used to reach out with incidents complaining their request not completing for long time.
  • This is where we needed orchestration instances to observe health of each queue by targeting one instance for desired control-queue.

Motivation:

  • motivation was to reduce the TTD for finding if orchestration can be stuck/waiting-forever in a control-queue irrespective of the cause.

Issue: No way to load lightly loaded control-queues.

  • We have face a few situations where some of control-queues are overwhelmed with orchestration instances while the others are happily processing almost nothing.

Motivation:

  • motivation was to target new instances of orchestration instances to set control-queue which are not heavily loaded

Proposal

API to generate instance id for a set of control-queues.

  • This API receives set of control-queues and prefix for instance id.
  • Implementation detail is: Allow special way of creating instance id with a suffix unsigned integer after delimiter '!' and explicitly use that value to route to control-queue (say suffixNumber % partitionCount). If this pattern is not used, it would goes back to default (current) which is hash(instance-id)%partition-count.

Copy link
Copy Markdown
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

Comment thread src/DurableTask.AzureStorage/AzureStorageOrchestrationServiceSettings.cs Outdated
Comment thread src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
pasaini-microsoft and others added 2 commits July 9, 2024 21:21
…ettings.cs

Co-authored-by: David Justo <david.justo.1996@gmail.com>
Comment on lines +62 to +69
controlQueueNumberToNameMap = new Dictionary<string, int>();

for (int i = 0; i < partitionCount; i++)
{
var controlQueueName = AzureStorageOrchestrationService.GetControlQueueName(settings.TaskHubName, i);
controlQueueNumberToNameMap[controlQueueName] = i;
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still using this in the new tests? No, right?

Copy link
Copy Markdown
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this for external events as well?

Comment on lines +322 to +326
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// </summary>
public bool EnableExplicitPartitionPlacement { get; set; } = false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider - it is not safe to change this from false to true (or vice-versa) while an orchestrator with the special syntax is in-flight. If we do that, any pre-existing messages for that orchestrator may now be considered to be "in the wrong queue".

Let's call this out in the intellisense


int placementSeparatorPosition = instanceId.LastIndexOf('!');

// if the instance id ends with !nnn, where nnn is an unsigned number, it indicates explicit partition placement
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test that documents the behavior if the customer uses an instanceID with multiple ! in there? Say instanceID "A!1!B!3` should probably map to partition "3", right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the first test, I think we're just missing the very last one:

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.