在集群部署模式下找不到 Spark 文件

Question

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:我正在尝试通过在 EMR 集群主节点中发出以集群部署模式运行 Spark 作业：

spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \ 
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties

I'm getting the following error, which states that the file can not be found at the Spark executor working directory:我收到以下错误，指出在 Spark 执行程序工作目录中找不到该文件：

//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
        
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
                
//This is me getting the error:
    20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
                    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
                    at java.nio.file.Files.newByteChannel(Files.java:361)
                    at java.nio.file.Files.newByteChannel(Files.java:407)
                    at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
                    at java.nio.file.Files.newInputStream(Files.java:152)
                    at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
                    at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
                    at com.someOrg.somePackage.someClass.main(someClass.scala)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.lang.reflect.Method.invoke(Method.java:498)
                    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)

This is the function I use to attempt to read the properties files passed as arguments:这是我用来尝试读取作为 arguments 传递的属性文件的 function：

def loadPropertiesFromFile(path: String): Properties = {
    val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
    val properties  = new Properties()
    properties.load(inputStream)
    properties
  }

Invoked as:调用为：

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))

The program works as expected when issued with client deploy mode:当以客户端部署模式发布时，该程序按预期工作：

spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties

This happens in Spark 2.4.5 / EMR 5.30.1.这发生在 Spark 2.4.5 / EMR 5.30.1 中。

Additionally, when I try to configure this job as an EMR step it does not even work in client mode.此外，当我尝试将此作业配置为 EMR 步骤时，它甚至无法在客户端模式下工作。 Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?有关如何通过--files选项在 EMR 中管理/持久/可用的资源文件的任何线索？

Answer 1

Option 1: Put those files in s3 and pass the s3 path.选项 1：将这些文件放在 s3 中并传递 s3 路径。 Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.选项 2：将这些文件复制到特定位置的每个节点（使用引导程序）并传递文件的绝对路径。

Answer 2

Solved with suggestions from the above comments:解决了上述评论的建议：

spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties

I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.我之前假设在cluster部署模式下，如果可以从发出 spark-submit 的机器访问，-- --files下的文件也会与部署到工作节点的驱动程序一起发送（因此在工作目录中可用）。

Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.底线：无论您从哪里发出 spark-submit 以及该机器中文件的可用性，在集群模式下，您都必须确保可以从每个工作节点访问文件。

It is now working by pointing files location to S3.它现在通过将文件位置指向 S3 来工作。

Thank you all!谢谢你们！

在集群部署模式下找不到 Spark 文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-03 22:53:07

解决方案2
0 2020-07-04 14:17:34

在集群部署模式下找不到 Spark 文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-03 22:53:07

解决方案2 0 2020-07-04 14:17:34

解决方案1
2 已采纳 2020-07-03 22:53:07

解决方案2
0 2020-07-04 14:17:34