简体   繁体   English

在群集模式下使用spark-submit共享配置文件

[英]Share config files with spark-submit in cluster mode

I've been running my spark jobs in "client" mode during development. 在开发期间,我一直在“客户端”模式下运行我的火花作业。 I use "--file" to share config files with executors. 我使用“--file”与执行程序共享配置文件。 Driver was reading config files locally. Driver正在本地读取配置文件。 Now I want to deploy the job in "cluster" mode. 现在我想以“集群”模式部署作业。 I'm having difficulty sharing the config files with driver now. 我现在很难与驱动程序共享配置文件。

Ex, I'm passing the config file name as extraJavaOptions to both driver and executors. 例如,我将配置文件名称作为extraJavaOptions传递给驱动程序和执行程序。 I'm reading the file using SparkFiles.get() 我正在使用SparkFiles.get()读取文件

  val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))

This works well on the executors but fails on driver. 这在执行程序上运行良好但在驱动程序上失败。 I think the files are only shared with executors and not with the container where driver is running. 我认为文件只与执行程序共享,而不是与运行驱动程序的容器共享。 One option is to keep the config files in S3. 一种选择是将配置文件保存在S3中。 I wanted to check if this can be achieved using spark-submit. 我想检查是否可以使用spark-submit实现这一点。

> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g \
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties \
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --class ....

You need to try the --properties-file option in Spark submit command. 您需要在Spark submit命令中尝试--properties-file选项。

For example properties file content 例如属性文件内容

spark.key1=value1
spark.key2=value2

All the keys needs to be prefixed with spark . 所有键都需要以sparkprefixed

then use the spark-submit command like this to pass the properties file. 然后使用这样的spark-submit命令传递属性文件。

bin/spark-submit --properties-file  propertiesfile.properties

Then in the code you can get the keys using below sparkcontext getConf method. 然后在代码中,您可以使用下面的sparkcontext getConf方法获取密钥。

sc.getConf.get("spark.key1")  // returns value1

Once you get the key values, you can pass use it everywhere. 获得关键值后,您可以在任何地方使用它。

I found a solution for this problem in this thread. 我在这个帖子中找到了解决这个问题的方法。

You can give an alias for the file you submitted through --files by adding '#alias' at the end. 您可以通过在末尾添加“#alias”为--files提交的文件提供别名。 By this trick, you should be able to access the files through their alias. 通过这个技巧,您应该能够通过别名访问文件。

For example, the following code can run without an error. 例如,以下代码可以在没有错误的情况下运行。

spark-submit --master yarn-cluster --files test.conf#testFile.conf test.py

with test.py as: 使用test.py作为:

path_f = 'testFile.conf'
try:
    f = open(path_f, 'r')
except:
    raise Exception('File not opened', 'EEEEEEE!')

and an empty test.conf 和一个空的test.conf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM