简体   繁体   中英

Netflix conductor HTTP tasks stuck on scheduled state for a long time

We have Netflix conductor deployed on GCP, with a strong Postgres persistence storage.

Whenever more then 3k workflows are starting to execute in parallel (Each workflow has like 4 HTTP tasks), The time that takes for an HTTP task to start executing is getting larger and larger.

It's simply stuck on scheduled state, could be stuck for long minutes on higher loads.

We checked the workload metrics for the conductor servers and the Postgres DB and they are far from reaching there resource limits.

We thought about using isolation tasks for these HTTP tasks, but that will not be beneficial since 80% of all tasks executed are these HTTP tasks that we don't want to be stuck on scheduled.

Which configurations\Settings\Setup should I change In order to solve the problem of HTTP tasks getting stuck on scheduled state?

Thanks

are some of your HTTP tasks longer tasks? These tasks might be using all of your available workers, placing some of the faster tasks into a queue.

You might consider isolation Groups for these longer HTTPS tasks so that the fast tasks can run through the regular HTTP workers:

https://conductor.netflix.com/configuration/isolationgroups.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM