简体   繁体   English

Java 中的 Spark 作业:如何在集群上运行时从“资源”访问文件

[英]Spark job in Java: how to access files from 'resources' when run on a cluster

I wrote a Spark job in Java.我用 Java 写了一份 Spark 工作。 The job is packaged as a shaded jar and executed:该作业被打包为一个带阴影的 jar 并执行:

spark-submit my-jar.jar

In the code, there are some files (Freemarker templates) that reside in src/main/resources/templates .在代码中,有一些文件(Freemarker 模板)位于src/main/resources/templates When run locally, I'm able access the files:在本地运行时,我可以访问文件:

File[] files = new File("src/main/resources/templates/").listFiles();

When the job is run on a cluster, a null-pointer exception is returned when the previous line is executed.当作业在集群上运行时,执行上一行时返回空指针异常。

If I run jar tf my-jar.jar I can see that the files are packaged in a templates/ folder:如果我运行jar tf my-jar.jar我可以看到这些文件被打包在一个templates/文件夹中:

 [...]
 templates/
 templates/my_template.ftl
 [...]

I'm just unable to read them;我只是无法阅读它们; I suspect that .listFiles() tries to access the local filesystem on the cluster node, and the files aren't there.我怀疑.listFiles()试图访问集群节点上的本地文件系统,但文件不存在。

I'm curious to know how I should package files to be used within a self-contained Spark job.我很想知道我应该如何打包要在独立 Spark 作业中使用的文件。 I'd rather not copy them to HDFS outside of the job because it becomes messy to maintain.我宁愿不在工作之外将它们复制到 HDFS,因为维护起来会很麻烦。

Your existing code is referencing them as files which are not packaged up and shipped to the Spark nodes.您现有的代码将它们引用为未打包并传送到 Spark 节点的文件。 But, since they're inside your jar file you should be able to reference them via Foo.getClass().getResourceAsStream("/templates/my_template_ftl") .但是,由于它们在您的 jar 文件中,您应该能够通过Foo.getClass().getResourceAsStream("/templates/my_template_ftl")引用它们。 More info on Java resource streams here: http://www.javaworld.com/article/2077352/java-se/smartly-load-your-properties.html有关 Java 资源流的更多信息,请访问: http : //www.javaworld.com/article/2077352/java-se/smartly-load-your-properties.html

It appears that running Scala (2.11) code on Spark does not support accessing resources in shaded jars.在 Spark 上运行 Scala (2.11) 代码似乎不支持访问带阴影的 jars 中的资源。

Executing this code:执行这段代码:

var path = getClass.getResource(fileName)
println("#### Resource: " + path.getPath())

prints the expected string when run outside of Spark.在 Spark 之外运行时打印预期的字符串。

When run inside Spark, a java.lang.NullPointerException is raised because path is null.在 Spark 中运行时,会引发java.lang.NullPointerException ,因为 path 为 null。

I have accessed my resource file like below in spark-scala.我已经在 spark-scala 中访问了我的资源文件,如下所示。 I've share my code please check.我已经分享了我的代码,请检查。

val fs=this.getClass().getClassLoader().getResourceAsStream("smoke_test/loadhadoop.txt")

val dataString=scala.io.Source.fromInputStream(fs).mkString

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不使用spark-submit的情况下将java程序中的spark作业提交到独立的spark集群? - How to submit spark job from within java program to standalone spark cluster without using spark-submit? 如何从单独的Java程序中在群集上运行Spark程序? - How to run a spark program on cluster from within a separate java program? 如何在 Apache Spark 集群上运行 Java 程序? - How to run a java program on Apache Spark Cluster? 在 EMR 集群上提交 Spark 作业时,如何避免 java.lang.NoClassDefFoundError? - How do I avoid java.lang.NoClassDefFoundError when submitting Spark job on EMR cluster? 从Apache Spark Streaming上下文访问JAR中资源目录中的文件 - Access files in resources directory in JAR from Apache Spark Streaming context 如何通过Servlet的doGet在Spark集群上运行Spark应用程序? - How to run Spark application on Spark cluster from Servlet's doGet? 在Spark集群上是否有一个参数可控制Spark作业的最小运行时间 - On Spark-cluster.Is there a parameter that controls the minimum run time of the spark job 如何将jar附加到正在执行作业的Spark集群上? - How to attach a jar to the spark cluster that is executing the job? Spark提交作业在集群模式下失败,但在Java中可用于本地HDFS中的copyToLocal - Spark submit job fails for cluster mode but works in local for copyToLocal from HDFS in java Mule - 在Studio和Standalone中运行时如何访问Java类中src / main / resources中的文件? - Mule - how to access files in src/main/resources in Java class when running in Studio and Standalone?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM