在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

Question

How to access and read local file data in Spark executing in Yarn Cluster Mode.如何在 Yarn Cluster 模式下访问和读取 Spark 中执行的本地文件数据。

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

Spark code to read csv:读取 csv 的 Spark 代码：

val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")

The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)上述示例 spark-submit 失败并出现本地文件未找到错误 (/home/test_dir/test_file.csv)

Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system. Spark 默认检查 hdfs:// 中的文件，但我的文件在本地，不应复制到 hfds 中，应仅从本地文件系统读取。

Any suggestions to resolve this error?有什么建议可以解决这个错误吗？

Answer 1

Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.使用file://前缀将从 YARN 节点管理器文件系统中提取文件，而不是从您提交代码的系统中提取文件。

To access your --files use csv("#test_file.csv")要访问您的--files使用csv("#test_file.csv")

should not be copied into hdfs不应复制到 hdfs

Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI使用--files会将文件复制到由 YARN 执行程序挂载的临时位置，您可以从 YARN UI 中看到它们

Answer 2

Below solution worked for me:以下解决方案对我有用：

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

To access file passed in spark-submit:要访问在 spark-submit 中传递的文件：

import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString

Instead of specifying complete path, specify only file name that we want to read.不要指定完整路径，而只指定我们要读取的文件名。 As spark already takes copy of file across nodes, we can access data of file with only file name.由于 spark 已经跨节点复制文件，我们可以仅使用文件名访问文件数据。

在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

问题描述

2 个解决方案

解决方案1
0 2022-02-04 15:12:51

解决方案2
0 2022-02-06 05:25:30

在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

问题描述

2 个解决方案

解决方案1 0 2022-02-04 15:12:51

解决方案2 0 2022-02-06 05:25:30

解决方案1
0 2022-02-04 15:12:51

解决方案2
0 2022-02-06 05:25:30