简体   繁体   中英

AWS Steps Function: timed out Fargate task not automatically killed

I have a AWS Step Function which is configured to run a Fargate task, wait for completion and do some other work. The Fargate task is a long running process which can potentially get stuck during execution. To avoid this, I have configured a TimeoutSeconds parameter in the task definition:

StartAt: FargateWorker
States:
  FargateWorker:
    Type: Task
    Resource: arn:aws:states:::ecs:runTask.waitForTaskToken
    InputPath: $
    ResultPath: $.workerResult
    OutputPath: $
    TimeoutSeconds: 3
    Parameters:
      Cluster: "#{EcsCluster}"
      TaskDefinition: "#{EcsTaskDefinition}"
      LaunchType: FARGATE
      EnableExecuteCommand: true
      NetworkConfiguration:
        AwsvpcConfiguration:
          Subnets:
            - xxx
            - yyy
            - zzz
          AssignPublicIp: DISABLED
      Overrides:
        ContainerOverrides:
          - Name: container-${env:STACK_NAME}
            Environment:
              - Name: TASK_TOKEN
                "Value.$": $$.Task.Token
    Catch:
      - ErrorEquals: ["States.ALL"]
        Next: CatchAllFallback
    Next: Done

I can see the state machine correctly moves to the CatchAllFallback state after TimeoutSeconds are passed, but the problem is that the Fargate container is still running, the state machine doesn't kill it. I need the container to be killed when the timeout triggers, so I don't end having a lot of zombie containers running until manual intervention. Is this something that can be addressed automatically by AWS in some way? Or any other solution?

One way to handle it would be specifically catch the timeout and run a step to kill the Fargate Task like so?

// Kill Task Lambda. Reference from [AWS Docs][1]
var params = {
  task: 'STRING_VALUE', /* required */
  cluster: 'STRING_VALUE',
  reason: 'STRING_VALUE'
};
ecs.stopTask(params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response
});
# State Machine
StartAt: FargateWorker
States:
  FargateWorker:
    Type: Task
    Resource: arn:aws:states:::ecs:runTask.waitForTaskToken
    InputPath: $
    ResultPath: $.workerResult
    OutputPath: $
    TimeoutSeconds: 3
    Parameters:
      Cluster: "#{EcsCluster}"
      TaskDefinition: "#{EcsTaskDefinition}"
      LaunchType: FARGATE
      EnableExecuteCommand: true
      NetworkConfiguration:
        AwsvpcConfiguration:
          Subnets:
            - xxx
            - yyy
            - zzz
          AssignPublicIp: DISABLED
      Overrides:
        ContainerOverrides:
          - Name: container-${env:STACK_NAME}
            Environment:
              - Name: TASK_TOKEN
                "Value.$": $$.Task.Token
    Catch:
      - ErrorEquals: ["States.Timeout"]
        Next: StopTimedOutTask
    Next: Done

  StopTimedOutTask:
    Type: Task
    Resource:
      Fn::GetAtt:
        - initializer
        - Arn
    ResultPath: $.filesInfo
    Next: ArchiveTransformAndSave

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM