How to get AWS EMR cluster id and step id from inside the spark application step submitted

Question

Scenario:
I am running the Spark Scala job in AWS EMR. Now my job dumps some metadata unique to that application. Now for dumping I am writing at location "s3://bucket/key/<APPLICATION_ID>" Where ApplicationId is val APPLICATION_ID: String = getSparkSession.sparkContext.getConf.getAppId

Now basically is there a way to write at s3 location something like "s3://bucket/key/<emr_cluster_id>_<emr_step_id>". How can i get the cluster id and step id from inside the spark Scala application.

Writing in this way will help me debug and help me in reaching the cluster based and debug the logs.

Is there any way other than reading the "/mnt/var/lib/info/job-flow.json" ?

PS: I am new to spark, scala and emr . Apologies in advance if this is an obvious query.

Answer 1

With PySpark on EMR, EMR_CLUSTER_ID and EMR_STEP_ID are available as environment variables (confirmed on emr-5.30.1).

They can be used in code as follows:

import os
emr_cluster_id = os.environ.get('EMR_CLUSTER_ID')
emr_step_id = os.environ.get('EMR_STEP_ID')

I can't test but the following similar code should work in Scala.

val emr_cluster_id = sys.env.get("EMR_CLUSTER_ID")
val emr_step_id = sys.env.get("EMR_STEP_ID")

Since sys.env is simply a Map[String, String] its get method returns an Option[String] , which doesn't fail if these environment variables don't exist. If you want to raise an Exception you could use sys.env("EMR_x_ID")

The EMR_CLUSTER_ID and EMR_STEP_ID variables are visible in the Spark History Server UI under the Environment tab, alongside with other variables that may be of interest.

Answer 2

I was having same problem recenlty to get the cluster-id programitically. I ended by using listClusters() method of the emrClient.

You can use Java SDK for AWS or Scala wrapper on top of it to use this method.

Answer 3

添加在AB的答案之上，您可以将集群 ID 传递给listSteps方法以获取步骤 ID 的列表，如下所示：

emrClient.listSteps(new ListStepsRequest().withClusterId(jobFlowId)).getSteps()

How to get AWS EMR cluster id and step id from inside the spark application step submitted

Question

3 answers

solution1
3 2021-05-20 11:16:39

solution2
1 2020-09-12 19:38:16

solution3
0 2020-09-12 22:13:29

How to get AWS EMR cluster id and step id from inside the spark application step submitted

Question

3 answers

solution1 3 2021-05-20 11:16:39

solution2 1 2020-09-12 19:38:16

solution3 0 2020-09-12 22:13:29

solution1
3 2021-05-20 11:16:39

solution2
1 2020-09-12 19:38:16

solution3
0 2020-09-12 22:13:29