Spark wholeTextFiles difference between shell and app

Question

I've copy-pasted a line that looks like this

val files = sc.wholeTextFiles("file:///path/to/files/*.csv")

from the Spark shell, where it runs, to an application, where it does not run. Instead I get that the pattern matches 0 files even though in the shell I can see all the files and Spark reads them.

What am I missing? Is this a file permissions problem?

I'm running the app as follows:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --files /usr/hdp/current/spark/conf/hive-site.xml \
  --num-executors 20 \
  --driver-memory 8G \
  --executor-memory 4G \
  --class com.myorg.pkg.MyApp \
  MyApp-assembly-0.1.jar

Answer 1

In order for this to work, all of your executors need access to this file. If the file is not on the local filesystem for every executor then you will run into issues.

One option would be to place the file on hdfs and provide the path as hdfs:/path/to/file.csv . This way all of the executors have access to it.

Another option would be to pass the file in the --files parameter. This will ship the file out to all the executors so they all have access to it.

Spark wholeTextFiles difference between shell and app

Question

1 answers

solution1
2 2016-04-21 14:11:07

Spark wholeTextFiles difference between shell and app

Question

1 answers

solution1 2 2016-04-21 14:11:07

solution1
2 2016-04-21 14:11:07