How do I set an environment variable in a YARN Spark job?

Question

I'm attempting to access Accumulo 1.6 from an Apache Spark job (written in Java) by using an AccumuloInputFormat with newAPIHadoopRDD . In order to do this, I have to tell the AccumuloInputFormat where to locate ZooKeeper by calling the setZooKeeperInstance method. This method takes a ClientConfiguration object which specifies various relevant properties.

I'm creating my ClientConfiguration object by calling the static loadDefault method. This method is supposed to look in various places for a client.conf file to load its defaults from. One of the places it's supposed to look is $ACCUMULO_CONF_DIR/client.conf .

Therefore, I am attempting to set the ACCUMULO_CONF_DIR environment variable in such a way that it will be visible when Spark runs the job (for reference, I'm attempting to run in the yarn-cluster deployment mode). I have not yet found a way to do that successfully.

So far, I've tried:

Calling setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf") on the SparkConf
Exporting ACCUMULO_CONF_DIR in spark-env.sh
Setting spark.executorEnv.ACCUMULO_CONF_DIR in spark-defaults.conf

None of them have worked. When I print the environment before calling setZooKeeperInstance , ACCUMULO_CONF_DIR does not appear.

If it's relevant, I'm using the CDH5 versions of everything.

Here's an example of what I'm trying to do (imports and exception handling left out for brevity):

public class MySparkJob
{
    public static void main(String[] args)
    {
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("MySparkJob");
        sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
        // Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
        ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
        AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
        // Other calls to AccumuloInputFormat static functions to configure it properly.
        JavaPairRDD<Key, Value> accumuloRDD =
            sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
                               AccumuloInputFormat.class,
                               Key.class,
                               Value.class);
    }
}

Answer 1

So I discovered the answer to this while writing the question (sorry, reputation seekers). The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable SPARK_YARN_USER_ENV to control its environment. So ensuring SPARK_YARN_USER_ENV contains ACCUMULO_CONF_DIR=/etc/accumulo/conf works, and makes ACCUMULO_CONF_DIR visible in the environment at the indicated point in the question's source example.

This difference in how standalone mode and YARN mode work resulted in SPARK-1680 , which is reported as fixed in Spark 1.1.0.

How do I set an environment variable in a YARN Spark job?

Question

1 answers

solution1
2 ACCPTED 2014-10-10 19:27:05

How do I set an environment variable in a YARN Spark job?

Question

1 answers

solution1 2 ACCPTED 2014-10-10 19:27:05

solution1
2 ACCPTED 2014-10-10 19:27:05