Spark WholeTextFiles外壳程序与应用程序之间的区别

Question

I've copy-pasted a line that looks like this 我已经复制粘贴了看起来像这样的一行

val files = sc.wholeTextFiles("file:///path/to/files/*.csv")

from the Spark shell, where it runs, to an application, where it does not run. 从运行它的Spark shell到不运行的应用程序。 Instead I get that the pattern matches 0 files even though in the shell I can see all the files and Spark reads them. 相反，我知道该模式匹配0个文件，即使在外壳中我可以看到所有文件，Spark也会读取它们。

What am I missing? 我想念什么？ Is this a file permissions problem? 这是文件权限问题吗？

I'm running the app as follows: 我正在运行该应用程序，如下所示：

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --files /usr/hdp/current/spark/conf/hive-site.xml \
  --num-executors 20 \
  --driver-memory 8G \
  --executor-memory 4G \
  --class com.myorg.pkg.MyApp \
  MyApp-assembly-0.1.jar

Answer 1

In order for this to work, all of your executors need access to this file. 为了使其正常工作，所有执行者都需要访问该文件。 If the file is not on the local filesystem for every executor then you will run into issues. 如果该文件不在每个执行程序的本地文件系统上，那么您将遇到问题。

One option would be to place the file on hdfs and provide the path as hdfs:/path/to/file.csv . 一种选择是将文件放在hdfs上，并将路径提供为hdfs:/path/to/file.csv 。 This way all of the executors have access to it. 这样，所有执行者都可以访问它。

Another option would be to pass the file in the --files parameter. 另一个选择是在--files参数中传递文件。 This will ship the file out to all the executors so they all have access to it. 这会将文件发送给所有执行者，以便他们都可以访问它。

Spark WholeTextFiles外壳程序与应用程序之间的区别

问题描述

1 个解决方案

解决方案1
2 2016-04-21 14:11:07

Spark WholeTextFiles外壳程序与应用程序之间的区别

问题描述

1 个解决方案

解决方案1 2 2016-04-21 14:11:07

解决方案1
2 2016-04-21 14:11:07