Scenario:
I am running the Spark Scala job in AWS EMR. Now my job dumps some metadata unique to that application. Now for dumping I am writing at location "s3://bucket/key/<APPLICATION_ID>" Where ApplicationId is val APPLICATION_ID: String = getSparkSession.sparkContext.getConf.getAppId
Now basically is there a way to write at s3 location something like "s3://bucket/key/<emr_cluster_id>_<emr_step_id>". How can i get the cluster id and step id from inside the spark Scala application.
Writing in this way will help me debug and help me in reaching the cluster based and debug the logs.
Is there any way other than reading the "/mnt/var/lib/info/job-flow.json" ?
PS: I am new to spark, scala and emr . Apologies in advance if this is an obvious query.
With PySpark on EMR, EMR_CLUSTER_ID
and EMR_STEP_ID
are available as environment variables (confirmed on emr-5.30.1).
They can be used in code as follows:
import os
emr_cluster_id = os.environ.get('EMR_CLUSTER_ID')
emr_step_id = os.environ.get('EMR_STEP_ID')
I can't test but the following similar code should work in Scala.
val emr_cluster_id = sys.env.get("EMR_CLUSTER_ID")
val emr_step_id = sys.env.get("EMR_STEP_ID")
Since sys.env
is simply a Map[String, String]
its get
method returns an Option[String]
, which doesn't fail if these environment variables don't exist. If you want to raise an Exception you could use sys.env("EMR_x_ID")
The EMR_CLUSTER_ID
and EMR_STEP_ID
variables are visible in the Spark History Server UI under the Environment tab, alongside with other variables that may be of interest.
I was having same problem recenlty to get the cluster-id programitically. I ended by using listClusters() method of the emrClient.
You can use Java SDK for AWS or Scala wrapper on top of it to use this method.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.