Run AWS EMR Cluster Using Step Functions

Question

I am very new to AWS Step Functions and AWS Lambda Functions and could really use some help getting an EMR Cluster running through Step Functions. A sample of my current State Machine structure is shown by the following code

{
  "Comment": "This is a test for running the structure of the CustomCreate job.",
  "StartAt": "PreStep",
  "States": {
    "PreStep": {
      "Comment": "Check that all the necessary files exist before running the job.",
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:XXXXXXXXXX:function:CustomCreate-PreStep-Function",
      "Next": "Run Job Choice"
    },
    "Run Job Choice": {
      "Comment": "This step chooses whether or not to go forward with running the main job.",
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.FoundNecessaryFiles",
          "BooleanEquals": true,
          "Next": "Spin Up Cluster"
        },
        {
          "Variable": "$.FoundNecessaryFiles",
          "BooleanEquals": false,
          "Next": "Do Not Run Job"
        }
      ]
    },
    "Do Not Run Job": {
      "Comment": "This step triggers if the PreStep fails and the job should not run.",
      "Type": "Fail",
      "Cause": "PreStep unsuccessful"
    },
    "Spin Up Cluster": {
      "Comment": "Spins up the EMR Cluster.",
      "Type": "Pass",
      "Next": "Update Env"
    },
    "Update Env": {
      "Comment": "Update the environment variables in the EMR Cluster.",
      "Type": "Pass",
      "Next": "Run Job"
    },
    "Run Job": {
      "Comment": "Add steps to the EMR Cluster.",
      "Type": "Pass",
      "End": true
    }
  }
}

Which is shown by the following workflow diagram

The PreStep and Run Job Choice tasks use a simple Lambda Function to check that the files necessary to run this job exist on my S3 Bucket, then go to spin up the cluster provided that the necessary files are found. These tasks are working properly.

What I am not sure about is how to handle the EMR Cluster related steps.

In my current structure, the first task is to spin up an EMR Cluster. this could be done through directly using the Step Function JSON, or preferably, using a JSON Cluster Config file (titled EMR-cluster-setup.json ) I have located on my S3 Bucket.

My next task is to update the EMR Cluster environment variables. I have a .sh script located on my S3 Bucket that can do this. I also have a JSON file (titled EMR-RUN-Script.json ) located on my S3 Bucket that will add a first step to the EMR Cluster that will run and source the .sh script. I just need to run that JSON file from within the EMR Cluster, which I do not know how to do using the Step Functions. The code for EMR-RUN-SCRIPT.json is shown below

[
    {
        "Name": "EMR-RUN-SCRIPT",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
            "Args": [
                "s3://PATH/TO/env_configs.sh"
            ]
        }
    }
]

My third task is to add a step that contains a spark-submit command to the EMR Cluster. This command is described in a JSON config file (titled EMR-RUN-STEP.json ) located on my S3 Bucket that can be uploaded to the EMR Cluster in a similar manner to uploading the environment configs file in the previous step. The code for EMR-RUN-STEP.json is shown below

[
    {
        "Name": "EMR-RUN-STEP",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "bash", "-c",
                "source /home/hadoop/.bashrc && spark-submit --master yarn --conf spark.yarn.submit.waitAppCompletion=false --class CLASSPATH.TO.MAIN s3://PATH/TO/JAR/FILE"
            ]
        }
    }
]

Finally, I want to have a task that makes sure the EMR Cluster terminates after it completes its run.

I know there may be a lot involved within this question, but I would greatly appreciate any assistance with any of the issues described above. Whether it be following the structure I outlined above, or if you know of another solution, I am open to any form of help. Thank you in advance.

Answer 1

You need a terminate cluster step, as documentation states: https://docs.aws.amazon.com/step-functions/latest/dg/connect-emr.html

createCluster uses the same request syntax as runJobFlow, except for the following:
The field Instances.KeepJobFlowAliveWhenNoSteps is mandatory, 
and must have the Boolean value TRUE.

So, you need a step to do this for you: terminateCluster.sync - for me this is preferable over the simple terminateCluster as it waits for the cluster to actually terminate and you can handle any hangs here - you'll be using Standard step functions so the bit of extra time will not be billed

Shuts down a cluster (job flow).

terminateJobFlows   The same as terminateCluster, but waits for the cluster to terminate.

ps.: if you are using termination protection you'll need an extra step to turn if off before you can terminate your cluster;)

Answer 2

'KeepJobFlowAliveWhenNoSteps': False

add the above configurations to emr cluster creation script. it will auto terminate emr clusters when all the steps are completed emr boto3 config

Run AWS EMR Cluster Using Step Functions

Question

2 answers

solution1
0 2020-05-15 09:54:09

solution2
-1 2019-11-08 15:45:54

Run AWS EMR Cluster Using Step Functions

Question

2 answers

solution1 0 2020-05-15 09:54:09

solution2 -1 2019-11-08 15:45:54

solution1
0 2020-05-15 09:54:09

solution2
-1 2019-11-08 15:45:54