Java 中的 Spark 作业：如何在集群上运行时从“资源”访问文件

Question

I wrote a Spark job in Java.我用 Java 写了一份 Spark 工作。 The job is packaged as a shaded jar and executed:该作业被打包为一个带阴影的 jar 并执行：

spark-submit my-jar.jar

In the code, there are some files (Freemarker templates) that reside in src/main/resources/templates .在代码中，有一些文件（Freemarker 模板）位于src/main/resources/templates 。 When run locally, I'm able access the files:在本地运行时，我可以访问文件：

File[] files = new File("src/main/resources/templates/").listFiles();

When the job is run on a cluster, a null-pointer exception is returned when the previous line is executed.当作业在集群上运行时，执行上一行时返回空指针异常。

If I run jar tf my-jar.jar I can see that the files are packaged in a templates/ folder:如果我运行jar tf my-jar.jar我可以看到这些文件被打包在一个templates/文件夹中：

 [...]
 templates/
 templates/my_template.ftl
 [...]

I'm just unable to read them;我只是无法阅读它们； I suspect that .listFiles() tries to access the local filesystem on the cluster node, and the files aren't there.我怀疑.listFiles()试图访问集群节点上的本地文件系统，但文件不存在。

I'm curious to know how I should package files to be used within a self-contained Spark job.我很想知道我应该如何打包要在独立 Spark 作业中使用的文件。 I'd rather not copy them to HDFS outside of the job because it becomes messy to maintain.我宁愿不在工作之外将它们复制到 HDFS，因为维护起来会很麻烦。

Answer 1

Your existing code is referencing them as files which are not packaged up and shipped to the Spark nodes.您现有的代码将它们引用为未打包并传送到 Spark 节点的文件。 But, since they're inside your jar file you should be able to reference them via Foo.getClass().getResourceAsStream("/templates/my_template_ftl") .但是，由于它们在您的 jar 文件中，您应该能够通过Foo.getClass().getResourceAsStream("/templates/my_template_ftl")引用它们。 More info on Java resource streams here: http://www.javaworld.com/article/2077352/java-se/smartly-load-your-properties.html有关 Java 资源流的更多信息，请访问： http : //www.javaworld.com/article/2077352/java-se/smartly-load-your-properties.html

Answer 2

It appears that running Scala (2.11) code on Spark does not support accessing resources in shaded jars.在 Spark 上运行 Scala (2.11) 代码似乎不支持访问带阴影的 jars 中的资源。

Executing this code:执行这段代码：

var path = getClass.getResource(fileName)
println("#### Resource: " + path.getPath())

prints the expected string when run outside of Spark.在 Spark 之外运行时打印预期的字符串。

When run inside Spark, a java.lang.NullPointerException is raised because path is null.在 Spark 中运行时，会引发java.lang.NullPointerException ，因为 path 为 null。

Answer 3

I have accessed my resource file like below in spark-scala.我已经在 spark-scala 中访问了我的资源文件，如下所示。 I've share my code please check.我已经分享了我的代码，请检查。

val fs=this.getClass().getClassLoader().getResourceAsStream("smoke_test/loadhadoop.txt")

val dataString=scala.io.Source.fromInputStream(fs).mkString

Java 中的 Spark 作业：如何在集群上运行时从“资源”访问文件

问题描述

3 个解决方案

解决方案1
13 已采纳 2016-04-17 19:35:58

解决方案2
11 2017-03-16 00:06:51

解决方案3
3 2020-03-12 17:16:22

Java 中的 Spark 作业：如何在集群上运行时从“资源”访问文件

问题描述

3 个解决方案

解决方案1 13 已采纳 2016-04-17 19:35:58

解决方案2 11 2017-03-16 00:06:51

解决方案3 3 2020-03-12 17:16:22

解决方案1
13 已采纳 2016-04-17 19:35:58

解决方案2
11 2017-03-16 00:06:51

解决方案3
3 2020-03-12 17:16:22