Spark yarn-cluster mode - read file passed with --files

Question

I'm running my spark application using yarn-cluster master.

What does the app do?

External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json

Simplest solution that came to mind was to use --files to load that file. In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:

/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json

Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.

Why simple:

val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)

cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?

My solution which works for now:

Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.

I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.

How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?

Answer 1

The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.

Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:

file:///home/user/my_file.txt

If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.

See more info on Spark, External Datasets .

For any files that were added through the --files option, or were added through SparkContext.addFile , you can get information about their location using the SparkFiles helper class.

Answer 2

Answer from @hartar worked for me. Here is the complete solution.

add required files during spark-submit using --files

spark-submit --name "my_job" --master yarn --deploy-mode cluster --files /home/xyz/file1.properties,/home/xyz/file2.properties --class test.main /home/xyz/my_test_jar.jar

get spark session inside main method

SparkSession ss = new SparkSession.Builder().getOrCreate();

Since i am interested only in .properties files, i am filtering it, instead if you know the file name which you wish to read then it can be directly used in FileInputStream.

spark.yarn.dist.files would have stored it as file:/home/xyz/file1.properties,file:/home/xyz/file2.properties hence splitting the string by (,) and (/) so that i can eliminate the rest of the content except the file name.

String[] files = Pattern.compile("/|,").splitAsStream(ss.conf().get("spark.yarn.dist.files")).filter(s -> s.contains(".properties")).toArray(String[]::new);

//load all files to Property                
for (String f : files) {
    props.load(new FileInputStream(f));
}

Answer 3

I had the same problem as you, in fact, you must know that when you send an executable and files, these are at the same level, so in your executable, it is enough that you just put the file name to Access it since your executable is based on its own folder.

You do not need to use sparkFiles or any other class. Just the method like readFile("myFile.json");

Answer 4

I have come across an easy way to do it. We are using Spark 2.3.0 on Yarn in pseudo distributed mode. We need to query a postgres table from spark whose configurations are defined in a properties file. I passed the property file using --files attribute of spark submit. To read the file in my code I simply used java.util.Properties.PropertiesReader class.

I just need to ensure that the path I specify when loading file is same as that passed in --files argument

eg if the spark submit command looked like: spark-submit --class --master yarn --deploy-mode client--files test/metadata.properties myjar.jar

Then my code to read the file will look like: Properties props = new Properties(); props.load(new FileInputStream(new File("test/metadata.properties")));

Hope you find this helpful.

Spark yarn-cluster mode - read file passed with --files

Question

My solution which works for now:

4 answers

solution1
2 2015-11-20 16:19:45

solution2
1 2018-03-28 23:11:54

solution3
0 2017-03-03 09:58:47

solution4
0 2019-01-10 13:00:42

Spark yarn-cluster mode - read file passed with --files

Question

My solution which works for now:

4 answers

solution1 2 2015-11-20 16:19:45

solution2 1 2018-03-28 23:11:54

solution3 0 2017-03-03 09:58:47

solution4 0 2019-01-10 13:00:42

solution1
2 2015-11-20 16:19:45

solution2
1 2018-03-28 23:11:54

solution3
0 2017-03-03 09:58:47

solution4
0 2019-01-10 13:00:42