简体   繁体   English

当前步骤失败时,AWS 步骤 function 不会将下一步添加到 EMR 集群

[英]AWS step function does not add next step to EMR cluster when current step fails

I have setup a state machine from AWS step function that will create a EMR cluster, add few emr steps and then terminate the cluster.我已经从 AWS 步骤 function 设置了一台 state 机器,它将创建一个 EMR 集群,添加一些 emr 步骤,然后终止集群。 This is working fine as long as all the steps are running to completion without any errors.只要所有步骤都运行完成且没有任何错误,这就可以正常工作。 If a step fails, despite adding a catch to proceed to the next step, this is not happening.如果一个步骤失败,尽管添加了一个 catch 以继续下一步,但这不会发生。 Whenever a step fails, the step is marked as caught(in ornage color in graph) but the next step is marked as cancelled.每当一个步骤失败时,该步骤被标记为已捕获(在图中以橙色表示),但下一步被标记为已取消。

This is my step function definition if it helps:如果有帮助,这是我的步骤 function 定义:

{
  "StartAt": "MyEMR-SMFlowContainer-beta",
  "States": {
    "MyEMR-SMFlowContainer-beta": {
      "Type": "Parallel",
      "End": true,
      "Branches": [
        {
          "StartAt": "CreateClusterStep-feature-generation-cluster-beta",
          "States": {
            "CreateClusterStep-feature-generation-cluster-beta": {
              "Next": "Step-SuccessfulJobOne",
              "Type": "Task",
              "ResultPath": "$.Cluster.1.CreateClusterTask",
              "Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
              "Parameters": {
                "Instances": {
                  "Ec2SubnetIds": [
                    "subnet-*******345fd38423"
                  ],
                  "InstanceCount": 2,
                  "KeepJobFlowAliveWhenNoSteps": true,
                  "MasterInstanceType": "m4.xlarge",
                  "SlaveInstanceType": "m4.xlarge"
                },
                "JobFlowRole": "MyEMR-emrInstance-beta-EMRInstanceRole",
                "Name": "emr-step-fail-handle-test-cluster",
                "ServiceRole": "MyEMR-emr-beta-EMRRole",
                "Applications": [
                  {
                    "Name": "Spark"
                  },
                  {
                    "Name": "Hadoop"
                  }
                ],
                "AutoScalingRole": "MyEMR-beta-FeatureG-CreateClusterStepfeature-NJB2UG1J1EWB",
                "Configurations": [
                  {
                    "Classification": "spark-env",
                    "Configurations": [
                      {
                        "Classification": "export",
                        "Properties": {
                          "PYSPARK_PYTHON": "/usr/bin/python3"
                        }
                      }
                    ]
                  }
                ],
                "LogUri": "s3://MyEMR-beta-feature-createclusterstepfeature-1jpp1wp3dfn04/emr/logs/",
                "ReleaseLabel": "emr-5.32.0",
                "VisibleToAllUsers": true
              }
            },
            "Step-SuccessfulJobOne": {
              "Next": "Step-AlwaysFailingJob",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Step-AlwaysFailingJob"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.SuccessfulJobOne.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "SuccessfulJobOne",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "1",
                      "--JOB_KEY",
                      "SuccessfulJobOne"
                    ]
                  }
                }
              }
            },
            "Step-AlwaysFailingJob": {
              "Next": "Step-SuccessfulJobTwo",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Step-SuccessfulJobTwo"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.AlwaysFailingJob.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "AlwaysFailingJob",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "2",
                      "--JOB_KEY",
                      "AlwaysFailingJob"
                    ]
                  }
                }
              }
            },
            "Step-SuccessfulJobTwo": {
              "Next": "TerminateClusterStep-feature-generation-cluster-beta",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "TerminateClusterStep-feature-generation-cluster-beta"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.SuccessfulJobTwo.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "DeviceJob",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "3",
                      "--JOB_KEY",
                      "SuccessfulJobTwo"
                    ]
                  }
                }
              }
            },
            "TerminateClusterStep-feature-generation-cluster-beta": {
              "End": true,
              "Type": "Task",
              "ResultPath": null,
              "Resource": "arn:aws:states:::elasticmapreduce:terminateCluster.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId"
              }
            }
          }
        }
      ]
    }
  },
  "TimeoutSeconds": 43200
}

Can somebody please advice on how I can catch a failure in step and ignore it add the next step.有人可以建议我如何在步骤中发现失败并忽略它添加下一步。 Thanks in advance.提前致谢。

Issue was because I was not specifying the resultPath in catch properties.问题是因为我没有在 catch 属性中指定 resultPath。 This was causing the resultPath to be overwritten by the catch block since default value of resultPath is $.这导致 resultPath 被 catch 块覆盖,因为 resultPath 的默认值为 $。 Next step is not able to get the cluster information since that is overwritten and hence got cancelled.下一步无法获取集群信息,因为该信息已被覆盖并因此被取消。

      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Step-SuccessfulJobTwo"
        }
      ],

Once I updated the catch to have a proper result path, it was working as expected.一旦我更新了 catch 以获得正确的结果路径,它就会按预期工作。

      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Step-SuccessfulJobTwo",
          "ResultPath": "$.ClusterStep.SuccessfulJobOne.AddSparkTask.Error",
        }
      ],

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM