简体   繁体   中英

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar 

Spark code to read csv:

val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")

The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)

Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.

Any suggestions to resolve this error?

Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.

To access your --files use csv("#test_file.csv")

should not be copied into hdfs

Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI

Below solution worked for me:

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

To access file passed in spark-submit:

import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString

Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM