Pyspark - 加载文件：路径不存在

Question

I am a newbie to Spark.我是 Spark 的新手。 I'm trying to read a local csv file within an EMR cluster.我正在尝试读取 EMR 集群中的本地 csv 文件。 The file is located in: /home/hadoop/.该文件位于：/home/hadoop/。 The script that I'm using is this one:我正在使用的脚本是这个：

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()\

df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)

When I run the script raises the following error message:当我运行脚本时会引发以下错误消息：

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv pyspark.sql.utils.AnalysisException：你的路径不存在：hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

Then, I found out that I have to add file:// in the file path so it can read the file locally:然后，我发现我必须在文件路径中添加 file:// 以便它可以在本地读取文件：

df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)

But this time, the above approach raised a different error:但是这一次，上述方法引发了一个不同的错误：

Lost task 0.3 in stage 0.0 (TID 3,在阶段 0.0 中丢失任务 0.3（TID 3，
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv 不存在

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.我认为是因为 file// 扩展只是在本地读取文件，它不会将文件分发到其他节点。

Do you know how can I read the csv file and make it available to all the other nodes?您知道如何读取 csv 文件并将其提供给所有其他节点吗？

Answer 1

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.您的文件从您的工作节点中丢失这一事实是正确的，因此会引发您遇到的错误。

Here is the official documentation Ref.这是官方文档参考。 External Datasets . 外部数据集。

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes.如果使用本地文件系统上的路径，则该文件也必须可以在工作节点上的相同路径上访问。 Either copy the file to all workers or use a network-mounted shared file system.将文件复制到所有工作人员或使用网络安装的共享文件系统。

So basically you have two solutions:所以基本上你有两个解决方案：

You copy your file into each worker before starting the job;在开始工作之前将文件复制到每个工人；

Or you'll upload in HDFS with something like: (recommended solution)或者你将上传到 HDFS 中，比如：（推荐的解决方案）

hadoop fs -put localfile /user/hadoop/hadoopfile.csv

Now you can read it with:现在你可以阅读它：

df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)

It seems that you are also using AWS S3.看来您也在使用 AWS S3。 You can always try to read it directly from S3 without downloading it.您始终可以尝试直接从 S3 读取它而无需下载它。 (with the proper credentials of course) （当然有适当的凭据）

Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories.有人建议使用 spark-submit 提供的 --files 标签将文件上传到执行目录。 I don't recommend this approach unless your csv file is very small but then you won't need Spark.我不推荐这种方法，除非你的 csv 文件非常小，但你不需要 Spark。

Alternatively, I would stick with HDFS (or any distributed file system).或者，我会坚持使用 HDFS（或任何分布式文件系统）。

Answer 2

I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this我认为你缺少的是在初始化 SparkSession 时明确设置主节点，尝试这样的事情

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

and then read the file in the same way you have been doing然后以与您一直在做的相同的方式读取文件

df = spark.read.csv('file:///home/hadoop/observations_temp.csv')

this should solve the problem...这应该可以解决问题......

Answer 3

Might be useful for someone running zeppelin on mac using Docker.可能对使用 Docker 在 mac 上运行 zeppelin 的人有用。

Copy files to custom folder: /Users/my_user/zeppspark/myjson.txt将文件复制到自定义文件夹：/Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0 docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:在 Zeppelin 上你可以运行这个来获取你的文件：

%pyspark %pyspark

json_data = sc.textFile('/zeppelin/notebook/myjson.txt') json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Pyspark - 加载文件：路径不存在

问题描述

3 个解决方案

解决方案1
28 已采纳 2017-02-07 16:23:04

解决方案2
5 2020-11-19 17:04:37

解决方案3
0 2021-05-13 20:33:03

Pyspark - 加载文件：路径不存在

问题描述

3 个解决方案

解决方案1 28 已采纳 2017-02-07 16:23:04

解决方案2 5 2020-11-19 17:04:37

解决方案3 0 2021-05-13 20:33:03

解决方案1
28 已采纳 2017-02-07 16:23:04

解决方案2
5 2020-11-19 17:04:37

解决方案3
0 2021-05-13 20:33:03