如何在YARN Spark作业中设置环境变量？

Question

I'm attempting to access Accumulo 1.6 from an Apache Spark job (written in Java) by using an AccumuloInputFormat with newAPIHadoopRDD . 我试图访问Accumulo 1.6从Apache的星火使用的作业（Java编写的） AccumuloInputFormat与newAPIHadoopRDD 。 In order to do this, I have to tell the AccumuloInputFormat where to locate ZooKeeper by calling the setZooKeeperInstance method. 为了做到这一点，我必须通过调用setZooKeeperInstance方法告诉AccumuloInputFormat在哪里找到ZooKeeper。 This method takes a ClientConfiguration object which specifies various relevant properties. 此方法采用ClientConfiguration对象，该对象指定各种相关属性。

I'm creating my ClientConfiguration object by calling the static loadDefault method. 我正在通过调用静态loadDefault方法创建我的ClientConfiguration对象。 This method is supposed to look in various places for a client.conf file to load its defaults from. 此方法应该在client.conf文件的各个位置查看以加载其默认值。 One of the places it's supposed to look is $ACCUMULO_CONF_DIR/client.conf . 应该看的其中一个地方是$ACCUMULO_CONF_DIR/client.conf 。

Therefore, I am attempting to set the ACCUMULO_CONF_DIR environment variable in such a way that it will be visible when Spark runs the job (for reference, I'm attempting to run in the yarn-cluster deployment mode). 因此，我试图设置ACCUMULO_CONF_DIR环境变量，使其在Spark运行作业时可见（作为参考，我试图在yarn-cluster部署模式下运行）。 I have not yet found a way to do that successfully. 我还没有找到成功的方法。

So far, I've tried: 到目前为止，我已经尝试过：

Calling setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf") on the SparkConf 调用setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf")上的SparkConf
Exporting ACCUMULO_CONF_DIR in spark-env.sh 导出ACCUMULO_CONF_DIR在spark-env.sh
Setting spark.executorEnv.ACCUMULO_CONF_DIR in spark-defaults.conf 在spark-defaults.conf设置spark.executorEnv.ACCUMULO_CONF_DIR

None of them have worked. 他们都没有工作。 When I print the environment before calling setZooKeeperInstance , ACCUMULO_CONF_DIR does not appear. 当我在调用setZooKeeperInstance之前打印环境时， ACCUMULO_CONF_DIR不会出现。

If it's relevant, I'm using the CDH5 versions of everything. 如果它是相关的，我正在使用CDH5版本的所有东西。

Here's an example of what I'm trying to do (imports and exception handling left out for brevity): 这是我正在尝试做的一个例子（为简洁而省略了导入和异常处理）：

public class MySparkJob
{
    public static void main(String[] args)
    {
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("MySparkJob");
        sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
        // Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
        ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
        AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
        // Other calls to AccumuloInputFormat static functions to configure it properly.
        JavaPairRDD<Key, Value> accumuloRDD =
            sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
                               AccumuloInputFormat.class,
                               Key.class,
                               Value.class);
    }
}

Answer 1

So I discovered the answer to this while writing the question (sorry, reputation seekers). 所以我在写这个问题时找到了答案（抱歉，信誉求职者）。 The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. 问题是CDH5使用Spark 1.0.0，而我正在通过YARN运行该作业。 Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable SPARK_YARN_USER_ENV to control its environment. 显然，YARN模式不会关注执行程序环境，而是使用环境变量SPARK_YARN_USER_ENV来控制其环境。 So ensuring SPARK_YARN_USER_ENV contains ACCUMULO_CONF_DIR=/etc/accumulo/conf works, and makes ACCUMULO_CONF_DIR visible in the environment at the indicated point in the question's source example. 因此，确保SPARK_YARN_USER_ENV包含ACCUMULO_CONF_DIR=/etc/accumulo/conf ，并使ACCUMULO_CONF_DIR在问题源示例中指定点的环境中可见。

This difference in how standalone mode and YARN mode work resulted in SPARK-1680 , which is reported as fixed in Spark 1.1.0. 独立模式和YARN模式工作方式的差异导致了SPARK-1680 ，报告在Spark 1.1.0中已修复。

如何在YARN Spark作业中设置环境变量？

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-10-10 19:27:05

如何在YARN Spark作业中设置环境变量？

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-10-10 19:27:05

解决方案1
2 已采纳 2014-10-10 19:27:05