[英]How do I set an environment variable in a YARN Spark job?
I'm attempting to access Accumulo 1.6 from an Apache Spark job (written in Java) by using an AccumuloInputFormat
with newAPIHadoopRDD
. 我试图访问Accumulo 1.6从Apache的星火使用的作业(Java编写的) AccumuloInputFormat
与newAPIHadoopRDD
。 In order to do this, I have to tell the AccumuloInputFormat
where to locate ZooKeeper by calling the setZooKeeperInstance
method. 为了做到这一点,我必须通过调用setZooKeeperInstance
方法告诉AccumuloInputFormat
在哪里找到ZooKeeper。 This method takes a ClientConfiguration
object which specifies various relevant properties. 此方法采用ClientConfiguration
对象,该对象指定各种相关属性。
I'm creating my ClientConfiguration
object by calling the static loadDefault
method. 我正在通过调用静态loadDefault
方法创建我的ClientConfiguration
对象。 This method is supposed to look in various places for a client.conf
file to load its defaults from. 此方法应该在client.conf
文件的各个位置查看以加载其默认值。 One of the places it's supposed to look is $ACCUMULO_CONF_DIR/client.conf
. 应该看的其中一个地方是$ACCUMULO_CONF_DIR/client.conf
。
Therefore, I am attempting to set the ACCUMULO_CONF_DIR
environment variable in such a way that it will be visible when Spark runs the job (for reference, I'm attempting to run in the yarn-cluster
deployment mode). 因此,我试图设置ACCUMULO_CONF_DIR
环境变量,使其在Spark运行作业时可见(作为参考,我试图在yarn-cluster
部署模式下运行)。 I have not yet found a way to do that successfully. 我还没有找到成功的方法。
So far, I've tried: 到目前为止,我已经尝试过:
setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf")
on the SparkConf
调用setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf")
上的SparkConf
ACCUMULO_CONF_DIR
in spark-env.sh
导出ACCUMULO_CONF_DIR
在spark-env.sh
spark.executorEnv.ACCUMULO_CONF_DIR
in spark-defaults.conf
在spark-defaults.conf
设置spark.executorEnv.ACCUMULO_CONF_DIR
None of them have worked. 他们都没有工作。 When I print the environment before calling setZooKeeperInstance
, ACCUMULO_CONF_DIR
does not appear. 当我在调用setZooKeeperInstance
之前打印环境时, ACCUMULO_CONF_DIR
不会出现。
If it's relevant, I'm using the CDH5 versions of everything. 如果它是相关的,我正在使用CDH5版本的所有东西。
Here's an example of what I'm trying to do (imports and exception handling left out for brevity): 这是我正在尝试做的一个例子(为简洁而省略了导入和异常处理):
public class MySparkJob
{
public static void main(String[] args)
{
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("MySparkJob");
sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
// Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
// Other calls to AccumuloInputFormat static functions to configure it properly.
JavaPairRDD<Key, Value> accumuloRDD =
sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
AccumuloInputFormat.class,
Key.class,
Value.class);
}
}
So I discovered the answer to this while writing the question (sorry, reputation seekers). 所以我在写这个问题时找到了答案(抱歉,信誉求职者)。 The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. 问题是CDH5使用Spark 1.0.0,而我正在通过YARN运行该作业。 Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable SPARK_YARN_USER_ENV
to control its environment. 显然,YARN模式不会关注执行程序环境,而是使用环境变量SPARK_YARN_USER_ENV
来控制其环境。 So ensuring SPARK_YARN_USER_ENV
contains ACCUMULO_CONF_DIR=/etc/accumulo/conf
works, and makes ACCUMULO_CONF_DIR
visible in the environment at the indicated point in the question's source example. 因此,确保SPARK_YARN_USER_ENV
包含ACCUMULO_CONF_DIR=/etc/accumulo/conf
,并使ACCUMULO_CONF_DIR
在问题源示例中指定点的环境中可见。
This difference in how standalone mode and YARN mode work resulted in SPARK-1680 , which is reported as fixed in Spark 1.1.0. 独立模式和YARN模式工作方式的差异导致了SPARK-1680 ,报告在Spark 1.1.0中已修复。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.