简体   繁体   English

Spark WholeTextFiles外壳程序与应用程序之间的区别

[英]Spark wholeTextFiles difference between shell and app

I've copy-pasted a line that looks like this 我已经复制粘贴了看起来像这样的一行

val files = sc.wholeTextFiles("file:///path/to/files/*.csv")

from the Spark shell, where it runs, to an application, where it does not run. 从运行它的Spark shell到不运行的应用程序。 Instead I get that the pattern matches 0 files even though in the shell I can see all the files and Spark reads them. 相反,我知道该模式匹配0个文件,即使在外壳中我可以看到所有文件,Spark也会读取它们。

What am I missing? 我想念什么? Is this a file permissions problem? 这是文件权限问题吗?

I'm running the app as follows: 我正在运行该应用程序,如下所示:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --files /usr/hdp/current/spark/conf/hive-site.xml \
  --num-executors 20 \
  --driver-memory 8G \
  --executor-memory 4G \
  --class com.myorg.pkg.MyApp \
  MyApp-assembly-0.1.jar

In order for this to work, all of your executors need access to this file. 为了使其正常工作,所有执行者都需要访问该文件。 If the file is not on the local filesystem for every executor then you will run into issues. 如果该文件不在每个执行程序的本地文件系统上,那么您将遇到问题。

One option would be to place the file on hdfs and provide the path as hdfs:/path/to/file.csv . 一种选择是将文件放在hdfs上,并将路径提供为hdfs:/path/to/file.csv This way all of the executors have access to it. 这样,所有执行者都可以访问它。

Another option would be to pass the file in the --files parameter. 另一个选择是在--files参数中传递文件。 This will ship the file out to all the executors so they all have access to it. 这会将文件发送给所有执行者,以便他们都可以访问它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM