使用 Step Functions 運行 AWS EMR 集群

Question

我對 AWS Step Functions 和 AWS Lambda Functions 非常陌生，並且確實可以使用一些幫助來讓 EMR 集群通過 Step Functions 運行。 我目前的 State 機器結構示例如下代碼所示

{
  "Comment": "This is a test for running the structure of the CustomCreate job.",
  "StartAt": "PreStep",
  "States": {
    "PreStep": {
      "Comment": "Check that all the necessary files exist before running the job.",
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:XXXXXXXXXX:function:CustomCreate-PreStep-Function",
      "Next": "Run Job Choice"
    },
    "Run Job Choice": {
      "Comment": "This step chooses whether or not to go forward with running the main job.",
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.FoundNecessaryFiles",
          "BooleanEquals": true,
          "Next": "Spin Up Cluster"
        },
        {
          "Variable": "$.FoundNecessaryFiles",
          "BooleanEquals": false,
          "Next": "Do Not Run Job"
        }
      ]
    },
    "Do Not Run Job": {
      "Comment": "This step triggers if the PreStep fails and the job should not run.",
      "Type": "Fail",
      "Cause": "PreStep unsuccessful"
    },
    "Spin Up Cluster": {
      "Comment": "Spins up the EMR Cluster.",
      "Type": "Pass",
      "Next": "Update Env"
    },
    "Update Env": {
      "Comment": "Update the environment variables in the EMR Cluster.",
      "Type": "Pass",
      "Next": "Run Job"
    },
    "Run Job": {
      "Comment": "Add steps to the EMR Cluster.",
      "Type": "Pass",
      "End": true
    }
  }
}

如下流程圖所示

PreStep和Run Job Choice任務使用簡單的 Lambda Function 檢查我的 S3 存儲桶中是否存在運行此作業所需的文件，然后在找到啟動集群所需的文件的情況下使用 go。 這些任務正在正常工作。

我不確定的是如何處理 EMR 集群相關的步驟。

在我目前的結構中，第一個任務是啟動 EMR 集群。 this could be done through directly using the Step Function JSON, or preferably, using a JSON Cluster Config file (titled EMR-cluster-setup.json ) I have located on my S3 Bucket.

我的下一個任務是更新 EMR 集群環境變量。 我的 S3 存儲桶上有一個.sh腳本可以執行此操作。 我的 S3 存儲桶上還有一個 JSON 文件（標題為EMR-RUN-Script.json ），它將向 EMR 集群添加第一步，該集群將運行和獲取.sh腳本。 我只需要從 EMR 集群中運行 JSON 文件，我不知道如何使用 Step Functions 來執行此操作。 EMR-RUN-SCRIPT.json的代碼如下所示

[
    {
        "Name": "EMR-RUN-SCRIPT",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
            "Args": [
                "s3://PATH/TO/env_configs.sh"
            ]
        }
    }
]

我的第三個任務是向 EMR 集群添加一個包含 spark-submit 命令的步驟。 此命令在位於我的 S3 存儲桶上的 JSON 配置文件（標題為EMR-RUN-STEP.json ）中進行了描述，該文件可以通過與上一步中上傳環境配置文件類似的方式上傳到 EMR 集群。 EMR-RUN-STEP.json的代碼如下所示

[
    {
        "Name": "EMR-RUN-STEP",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "bash", "-c",
                "source /home/hadoop/.bashrc && spark-submit --master yarn --conf spark.yarn.submit.waitAppCompletion=false --class CLASSPATH.TO.MAIN s3://PATH/TO/JAR/FILE"
            ]
        }
    }
]

最后，我想要一個確保 EMR 集群在完成運行后終止的任務。

我知道這個問題可能涉及很多內容，但對於上述任何問題的任何幫助，我將不勝感激。 無論是遵循我上面概述的結構，還是您知道其他解決方案，我都願意接受任何形式的幫助。 先感謝您。

Answer 1

您需要終止集群步驟，如文檔所述： https://docs.aws.amazon.com/step-functions/latest/dg/connect-emr.html

createCluster uses the same request syntax as runJobFlow, except for the following:
The field Instances.KeepJobFlowAliveWhenNoSteps is mandatory, 
and must have the Boolean value TRUE.

因此，您需要一個步驟來為您執行此操作：terminateCluster.sync -對我來說，這比簡單的 terminateCluster 更可取，因為它等待集群實際終止並且您可以在此處處理任何掛起 - 您將使用標准步驟函數所以額外的時間不會被計費

Shuts down a cluster (job flow).

terminateJobFlows   The same as terminateCluster, but waits for the cluster to terminate.

ps.：如果您正在使用終止保護，則需要一個額外的步驟來關閉 if 才能終止集群；）

Answer 2

'KeepJobFlowAliveWhenNoSteps'：錯誤

將上述配置添加到 emr 集群創建腳本中。 完成所有步驟后，它將自動終止 emr 集群emr boto3 config

使用 Step Functions 運行 AWS EMR 集群

問題描述

2 個解決方案

解決方案1
0 2020-05-15 09:54:09

解決方案2
-1 2019-11-08 15:45:54

使用 Step Functions 運行 AWS EMR 集群

問題描述

2 個解決方案

解決方案1 0 2020-05-15 09:54:09

解決方案2 -1 2019-11-08 15:45:54

解決方案1
0 2020-05-15 09:54:09

解決方案2
-1 2019-11-08 15:45:54