在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

Question

如何在 Yarn Cluster 模式下访问和读取 Spark 中执行的本地文件数据。

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

读取 csv 的 Spark 代码：

val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")

上述示例 spark-submit 失败并出现本地文件未找到错误 (/home/test_dir/test_file.csv)

Spark 默认检查 hdfs:// 中的文件，但我的文件在本地，不应复制到 hfds 中，应仅从本地文件系统读取。

有什么建议可以解决这个错误吗？

Answer 1

使用file://前缀将从 YARN 节点管理器文件系统中提取文件，而不是从您提交代码的系统中提取文件。

要访问您的--files使用csv("#test_file.csv")

不应复制到 hdfs

使用--files会将文件复制到由 YARN 执行程序挂载的临时位置，您可以从 YARN UI 中看到它们

Answer 2

以下解决方案对我有用：

local/linux file: /home/test_dir/test_file.csv

spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar

要访问在 spark-submit 中传递的文件：

import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString

不要指定完整路径，而只指定我们要读取的文件名。 由于 spark 已经跨节点复制文件，我们可以仅使用文件名访问文件数据。

在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

问题描述

2 个解决方案

解决方案1
0 2022-02-04 15:12:51

解决方案2
0 2022-02-06 05:25:30

在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件

问题描述

2 个解决方案

解决方案1 0 2022-02-04 15:12:51

解决方案2 0 2022-02-06 05:25:30

解决方案1
0 2022-02-04 15:12:51

解决方案2
0 2022-02-06 05:25:30