简体   繁体   English

在集群部署模式下找不到 Spark 文件

[英]Spark files not found in cluster deploy mode

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:我正在尝试通过在 EMR 集群主节点中发出以集群部署模式运行 Spark 作业:

spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \ 
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties

I'm getting the following error, which states that the file can not be found at the Spark executor working directory:我收到以下错误,指出在 Spark 执行程序工作目录中找不到该文件:

//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
        
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
                
//This is me getting the error:
    20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
                    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
                    at java.nio.file.Files.newByteChannel(Files.java:361)
                    at java.nio.file.Files.newByteChannel(Files.java:407)
                    at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
                    at java.nio.file.Files.newInputStream(Files.java:152)
                    at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
                    at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
                    at com.someOrg.somePackage.someClass.main(someClass.scala)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.lang.reflect.Method.invoke(Method.java:498)
                    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)

This is the function I use to attempt to read the properties files passed as arguments:这是我用来尝试读取作为 arguments 传递的属性文件的 function:

def loadPropertiesFromFile(path: String): Properties = {
    val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
    val properties  = new Properties()
    properties.load(inputStream)
    properties
  }

Invoked as:调用为:

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))

The program works as expected when issued with client deploy mode:当以客户端部署模式发布时,该程序按预期工作:

spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties

This happens in Spark 2.4.5 / EMR 5.30.1.这发生在 Spark 2.4.5 / EMR 5.30.1 中。

Additionally, when I try to configure this job as an EMR step it does not even work in client mode.此外,当我尝试将此作业配置为 EMR 步骤时,它甚至无法在客户端模式下工作。 Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?有关如何通过--files选项在 EMR 中管理/持久/可用的资源文件的任何线索?

Option 1: Put those files in s3 and pass the s3 path.选项 1:将这些文件放在 s3 中并传递 s3 路径。 Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.选项 2:将这些文件复制到特定位置的每个节点(使用引导程序)并传递文件的绝对路径。

Solved with suggestions from the above comments:解决了上述评论的建议:

spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties

I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.我之前假设在cluster部署模式下,如果可以从发出 spark-submit 的机器访问,-- --files下的文件也会与部署到工作节点的驱动程序一起发送(因此在工作目录中可用)。

Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.底线:无论您从哪里发出 spark-submit 以及该机器中文件的可用性,在集群模式下,您都必须确保可以从每个工作节点访问文件。

It is now working by pointing files location to S3.它现在通过将文件位置指向 S3 来工作。

Thank you all!谢谢你们!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Spark in Cluster模式将文件写入本地系统 - Writing files to local system with Spark in Cluster mode 如何在集群上部署Spark-shell中使用的Scala文件? - How to deploy scala files used in spark-shell on cluster? 为什么使用主纱线和部署模式群集的`spark.yarn.stagingDir`导致火花提交失败 - Why spark-submit fails with `spark.yarn.stagingDir` with master yarn and deploy-mode cluster 无法在Spark Kubernetes集群模式下读取本地文件 - Unable to read local files in spark kubernetes cluster mode 在部署模式=集群下提交火花应用程序时如何读取边缘节点上存在的文件 - how to read a file present on the edge node when submit spark application in deploy mode = cluster 在Spark Submit中访问Spark集群模式 - Accessing spark cluster mode in spark submit 如何将配置文件添加到在 YARN-CLUSTER 模式下运行的 Spark 作业? - How can I add configuration files to a Spark job running in YARN-CLUSTER mode? 在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件 - Read local/linux files in Spark Scala code executing in Yarn Cluster Mode 纱线群集模式下Spark作业的ClassNotFoundException - ClassNotFoundException for Spark job on Yarn-cluster mode 群集模式下的Spark记录无法打印 - Spark Logging in Cluster Mode Prints Nothing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM