[英]Spark files not found in cluster deploy mode
I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:我正在尝试通过在 EMR 集群主节点中发出以集群部署模式运行 Spark 作业:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:我收到以下错误,指出在 Spark 执行程序工作目录中找不到该文件:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:这是我用来尝试读取作为 arguments 传递的属性文件的 function:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:调用为:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:当以客户端部署模式发布时,该程序按预期工作:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.这发生在 Spark 2.4.5 / EMR 5.30.1 中。
Additionally, when I try to configure this job as an EMR step it does not even work in client mode.此外,当我尝试将此作业配置为 EMR 步骤时,它甚至无法在客户端模式下工作。 Any clue on how are the resource files passed through
--files
option managed/persisted/available in EMR?有关如何通过
--files
选项在 EMR 中管理/持久/可用的资源文件的任何线索?
Option 1: Put those files in s3 and pass the s3 path.选项 1:将这些文件放在 s3 中并传递 s3 路径。 Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
选项 2:将这些文件复制到特定位置的每个节点(使用引导程序)并传递文件的绝对路径。
Solved with suggestions from the above comments:解决了上述评论的建议:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster
deploy mode the files under --files
were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.我之前假设在
cluster
部署模式下,如果可以从发出 spark-submit 的机器访问,-- --files
下的文件也会与部署到工作节点的驱动程序一起发送(因此在工作目录中可用)。
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.底线:无论您从哪里发出 spark-submit 以及该机器中文件的可用性,在集群模式下,您都必须确保可以从每个工作节点访问文件。
It is now working by pointing files location to S3.它现在通过将文件位置指向 S3 来工作。
Thank you all!谢谢你们!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.