在群集模式下使用spark-submit共享配置文件

Question

在开发期间，我一直在“客户端”模式下运行我的火花作业。 我使用“--file”与执行程序共享配置文件。 Driver正在本地读取配置文件。 现在我想以“集群”模式部署作业。 我现在很难与驱动程序共享配置文件。

例如，我将配置文件名称作为extraJavaOptions传递给驱动程序和执行程序。 我正在使用SparkFiles.get（）读取文件

  val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))

这在执行程序上运行良好但在驱动程序上失败。 我认为文件只与执行程序共享，而不是与运行驱动程序的容器共享。 一种选择是将配置文件保存在S3中。 我想检查是否可以使用spark-submit实现这一点。

> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g \
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties \
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --class ....

Answer 1

您需要在Spark submit命令中尝试--properties-file选项。

例如属性文件内容

spark.key1=value1
spark.key2=value2

所有键都需要以spark为prefixed 。

然后使用这样的spark-submit命令传递属性文件。

bin/spark-submit --properties-file  propertiesfile.properties

然后在代码中，您可以使用下面的sparkcontext getConf方法获取密钥。

sc.getConf.get("spark.key1")  // returns value1

获得关键值后，您可以在任何地方使用它。

Answer 2

我在这个帖子中找到了解决这个问题的方法。

您可以通过在末尾添加“#alias”为--files提交的文件提供别名。 通过这个技巧，您应该能够通过别名访问文件。

例如，以下代码可以在没有错误的情况下运行。

spark-submit --master yarn-cluster --files test.conf#testFile.conf test.py

使用test.py作为：

path_f = 'testFile.conf'
try:
    f = open(path_f, 'r')
except:
    raise Exception('File not opened', 'EEEEEEE!')

和一个空的test.conf

在群集模式下使用spark-submit共享配置文件

问题描述

2 个解决方案

解决方案1
3 2016-10-22 08:53:08

解决方案2
1 2018-04-16 21:48:24

在群集模式下使用spark-submit共享配置文件

问题描述

2 个解决方案

解决方案1 3 2016-10-22 08:53:08

解决方案2 1 2018-04-16 21:48:24

解决方案1
3 2016-10-22 08:53:08

解决方案2
1 2018-04-16 21:48:24