In Azure Durable Functions, how do I appropriately determine the progression of a large number of parallel activities?

Question

I have written a Durable Function orchestrator function, whose primary job is to fan-out an average of 1,000 parallel activities. Since the completion of these activities is something a front-end user would technically be waiting on, I would like be able to query the progress while the activities are still running (to show a progress bar on the front end).

Below is a chunk of the current orchestrator code, but I am having doubts on whether or not it fits the constraints for orchestrator functions ( https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-checkpointing-and-replay#orchestrator-code-constraints ).

Basically, if the DF framework is replaying the orchestrator up to each await, this feels like an unreasonable number of awaits for it to process:

var replicationTasks = new List<Task<ReplicationOutput>>();
var replicationResults = new List<ReplicationOutput>();
// start up each simulation
for (int i = 0; i < inputs.NumberOfReplications; i++)
{
    var replicationInput = new ReplicationInput();
    var task = context.CallActivityAsync<ReplicationOutput>("SimulationOrchestrator_SimulateReplication", replicationInput);
    replicationTasks.Add(task);
}
// set initial custom status 
var progress = new Progress();
progress.NumberCompleted = 0;
progress.Total = inputs.NumberOfReplications;
progress.TimeStarted = context.CurrentUtcDateTime;
progress.ElapsedTime = context.CurrentUtcDateTime.Subtract(progress.TimeStarted);
context.SetCustomStatus(progress);

// as each task finishes
while (replicationTasks.Any())
{
    Task<ReplicationOutput> nextFinished = await Task.WhenAny(replicationTasks);

    replicationTasks.Remove(nextFinished);
    replicationResults.Add(await nextFinished);

    // update progress object and custom status
    progress.NumberCompleted++;
    progress.ElapsedTime = context.CurrentUtcDateTime.Subtract(progress.TimeStarted);
    context.SetCustomStatus(progress);
}
// aggregate replications together into a single set of results
return new Results(replicationResults);

This doesn't necessarily fail under simple testing conditions, but the orchestrator documentation warns (fairly aggressively) about keeping the history tables clear, avoiding waiting/blocking, etc.

Is there a documented or "best practices" method for accomplishing the goal of having queryable progress? All fan-out/fan-in examples I've seen just use await Task.WhenAll(replicationTasks) to only continue once all tasks are complete, which I don't think would allow incremental progress checks.

Answer 1

You appear to have two questions here:

Excessive replays caused by a large number of actions

Durable Functions are known to degrade in performance when they have to replay a large number of actions. In the .NET runtime for Durable Functions, DFs automatically abort after 100,000 actions ( Github ) ( StackOverflow ). This 100k limit is configurable, but it represents the only guideline I've found for "how many actions are DFs intended to handle".

I haven't seen anyone online discussing architectures that pare down how many actions each Durable Function is responsible for. I have the same use case you describe (fan-out with limited concurrency), and I've found a few options:

Use the continueAsNew DF API call to periodically restart the orchestrator with a fresh, empty replay history. That call takes a parameter where you can pass along any state that you want the fresh instance of the DF to have (ie, the rest of the workload). It's the same concepts as in function recursion, just with DFs. This is a relatively straightforward way to directly address the replay performance problem without introducing new components to your architecture. The cost is your processing periodically grinds to a halt while you avoid spawning new Activities in preparation for restarting the Orchestrator.
You can limit excessive replays batching up tasks and having a separate layer of DFs that process batches of tasks. That makes your top-level Orchestrator responsible for 1/n as many actions, where n is the batch size. I feel this is a clumsy solution and it limits the granularity at which your top-level DF can supervise the fan-out, but it solves the replay problem.
You can use Extended Sessions to delay your Durable Functions from shutting down for a certain amount of time after they invoke an action. If your DF gets activated during that time, it continues execution without a replay, as if it was just an ordinary program doing its thing. You'll pay the cost of keeping it running all that time, but if it has a large replay history, it's probably waking up so often that the extra execution cost is negligible.
If you don't need fine-grained control over concurrency and task scheduling, you might instead consider having your Orchestrator push tasks to a Storage Queue having queue messages trigger ordinary Functions. Then your DF can end its lifecycle as soon as it's handed off the task data to the Queue. If it needs to handle the task output, it can wait to resume processing until another Function sends an event notifying it that all of the tasks have been processed.

Querying the progress of your DF

Durable Functions have an explicit mechanism for conveying their progress to interested parties. Alternatively, your Orchestrator could persist its progress somewhere other parties can access it.

Your first option is using the Durable Function's custom status feature. You can periodically update the status of your Durable Function to reflect its progress, and query the DF's progress .

Another option is Durable Entities; they're a convenient way for a DF to store data that persists outside the context and lifecycle of the DF, and crucially, the Client Functions inside your DF's Function App can read the Durable Entities that your DF writes to. Wrap an HTTP trigger around that and bam , you've got progress querying.

Finally, it's common to handle this task by writing progress into any major database and reading back from the DB in the poll HTTP endpoint.

In Azure Durable Functions, how do I appropriately determine the progression of a large number of parallel activities?

Question

1 answers

solution1
0 2020-10-19 18:38:45

Excessive replays caused by a large number of actions

Querying the progress of your DF

In Azure Durable Functions, how do I appropriately determine the progression of a large number of parallel activities?

Question

1 answers

solution1 0 2020-10-19 18:38:45

Excessive replays caused by a large number of actions

Querying the progress of your DF

solution1
0 2020-10-19 18:38:45