简体   繁体   中英

Does AWS Step Functions have a timeout feature?

Right now I have an AWS Step Function to create, run, and terminate EMR cluster jobs. I want to add a timeout feature to stop the job and terminate the cluster in the case that a cluster gets stuck or is taking too long to run (eg have an input variable "TIMEOUT_AFTER_X_HOURS": 12 passed into the state machine along with the cluster configs which will automatically stop the job and kill the cluster if it still running after 12 hours). Does anyone know how to accomplish this?

Unfortunately you can't dynamically specify the timeout for a state, but you can dynamically tell a Wait state how long it should wait. With that said, I would recommend that you use a Parallel State with two branches and a catch block. The first branch contains a Wait State and a Fail State (your timeout). The other branch contains your normal State Machine logic and a Fail State.

Whenever a branch fails inside a Parallel state, it aborts all running states in the other branches. Luckily you are able to catch these errors in the Parallel State and redirect it to another state depending on which branch failed. Heres an example of what I mean (change the values in the HardCodedInputs state to control which branch fails).

{
"StartAt": "HardCodedInputs",
"States": {
    "HardCodedInputs": {
        "Type": "Pass",
        "Parameters": {
            "WaitBranchInput": {
                "timeout": 5,
                "Comment": "Change the value of timeout"
            },
            "WorkerBranchInput": {
                "SecondsPath": 3,
                "Comment": "SecondsPath is used for testing purposes to simulate how long the worker will run"
            }
        },
        "Next": "Parallel"
    },
    "Parallel": {
        "Type": "Parallel",
        "End": true,
        "Catch": [{
            "ErrorEquals": ["TimeoutExpired"],
            "ResultPath": "$.ParralelStateOutput",
            "Next": "ExecuteIfTimedOut"
        }, {
            "ErrorEquals": ["WorkerSuccess"],
            "ResultPath": "$.ParralelStateOutput",
            "Next": "ExecuteIfWorkerSuccesfull"
        }],
        "Branches": [{
                "StartAt": "DynamicTimeout",
                "States": {
                    "DynamicTimeout": {
                        "Type": "Wait",
                        "InputPath": "$.WaitBranchInput",
                        "SecondsPath": "$.timeout",
                        "Next": "TimeoutExpired"
                    },
                    "TimeoutExpired": {
                        "Type": "Fail",
                        "Cause": "TimeoutExceeded.",
                        "Error": "TimeoutExpired"
                    }
                }
            },
            {
                "StartAt": "WorkerState",
                "States": {
                    "WorkerState": {
                        "Type": "Wait",
                      "InputPath": "$.WorkerBranchInput",
                        "SecondsPath": "$.SecondsPath",
                        "Next": "WorkerSuccessful"
                    },
                    "WorkerSuccessful": {
                        "Type": "Fail",
                        "Cause": "Throw Worker Success Exception",
                        "Error": "WorkerSuccess"
                    }
                }
            }
        ]
    },
    "ExecuteIfTimedOut": {
        "Type": "Pass",
        "End": true
    },
    "ExecuteIfWorkerSuccesfull": {
        "Type": "Pass",
        "End": true
    }
 }
}

You can pass the path to an input variable (eg "$.TIMEOUT_AFTER_X_HOURS" from the original example) to the TimeoutSecondsPath parameter of any Task. This will allow you to dynamically set a step timeout based on State Machine inputs or outputs of previous steps.

You can find the official Docs for the TimeoutSecondsPath parameter here: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-task-state.html

I'm facing a similar problem, and I'm thinking the solution will be to create an outer state machine that manages EMR and an inner one that performs the work.

So outer would be:

  • create EMR
  • invoke child state machine with a TimeoutSeconds set on the task per your input variable
  • terminate EMR

And inner would have a content of:

  • perform EMR work

The inner machine therefore would return after either it completes successfully or TimeoutSeconds have elapsed, and in the outer machine you can detect which (using a Catch state to catch the States.Timeout error) and act accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM