简体   繁体   English

Pyspark - 加载文件:路径不存在

[英]Pyspark - Load file: Path does not exist

I am a newbie to Spark.我是 Spark 的新手。 I'm trying to read a local csv file within an EMR cluster.我正在尝试读取 EMR 集群中的本地 csv 文件。 The file is located in: /home/hadoop/.该文件位于:/home/hadoop/。 The script that I'm using is this one:我正在使用的脚本是这个:

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()\

df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)

When I run the script raises the following error message:当我运行脚本时会引发以下错误消息:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv pyspark.sql.utils.AnalysisException:你的路径不存在:hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

Then, I found out that I have to add file:// in the file path so it can read the file locally:然后,我发现我必须在文件路径中添加 file:// 以便它可以在本地读取文件:

df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)

But this time, the above approach raised a different error:但是这一次,上述方法引发了一个不同的错误:

Lost task 0.3 in stage 0.0 (TID 3,在阶段 0.0 中丢失任务 0.3(TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv 不存在

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.我认为是因为 file// 扩展只是在本地读取文件,它不会将文件分发到其他节点。

Do you know how can I read the csv file and make it available to all the other nodes?您知道如何读取 csv 文件并将其提供给所有其他节点吗?

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.您的文件从您的工作节点中丢失这一事实是正确的,因此会引发您遇到的错误。

Here is the official documentation Ref.这是官方文档参考。 External Datasets . 外部数据集

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes.如果使用本地文件系统上的路径,则该文件也必须可以在工作节点上的相同路径上访问。 Either copy the file to all workers or use a network-mounted shared file system.将文件复制到所有工作人员或使用网络安装的共享文件系统。

So basically you have two solutions:所以基本上你有两个解决方案:

You copy your file into each worker before starting the job;在开始工作之前将文件复制到每个工人;

Or you'll upload in HDFS with something like: (recommended solution)或者你将上传到 HDFS 中,比如:(推荐的解决方案)

hadoop fs -put localfile /user/hadoop/hadoopfile.csv

Now you can read it with:现在你可以阅读它:

df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)

It seems that you are also using AWS S3.看来您也在使用 AWS S3。 You can always try to read it directly from S3 without downloading it.您始终可以尝试直接从 S3 读取它而无需下载它。 (with the proper credentials of course) (当然有适当的凭据)

Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories.有人建议使用 spark-submit 提供的 --files 标签将文件上传到执行目录。 I don't recommend this approach unless your csv file is very small but then you won't need Spark.我不推荐这种方法,除非你的 csv 文件非常小,但你不需要 Spark。

Alternatively, I would stick with HDFS (or any distributed file system).或者,我会坚持使用 HDFS(或任何分布式文件系统)。

I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this我认为你缺少的是在初始化 SparkSession 时明确设置主节点,尝试这样的事情

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

and then read the file in the same way you have been doing然后以与您一直在做的相同的方式读取文件

df = spark.read.csv('file:///home/hadoop/observations_temp.csv')

this should solve the problem...这应该可以解决问题......

Might be useful for someone running zeppelin on mac using Docker.可能对使用 Docker 在 mac 上运行 zeppelin 的人有用。

  1. Copy files to custom folder: /Users/my_user/zeppspark/myjson.txt将文件复制到自定义文件夹:/Users/my_user/zeppspark/myjson.txt

  2. docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0 docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0

  3. On Zeppelin you can run this to get your file:在 Zeppelin 上你可以运行这个来获取你的文件:

%pyspark %pyspark

json_data = sc.textFile('/zeppelin/notebook/myjson.txt') json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 指定的“公共”目录 {path} 不存在 - Specified "public" directory {path} does not exist Gitlab Shell Runner 无法锁定配置文件文件不存在 - Gitlab Shell Runner could not lock config file file does not exist 将文件上传到 firebase 存储,存储异常对象在该位置不存在 - upload file to firebase storage, Storage exception Object does not exist at location Firebase 部署错误“找不到页面此文件不存在...” - Firebase deploy error "page not found This file does not exist ..." Firebase:此文件不存在且没有索引。在当前目录中找到html或在根目录中找到404.html - Firebase: This file does not exist and there was no index.html found in the current directory or 404.html in the root directory 为什么零旗存在? - Why does the Zero Flag exist? GCP:负载均衡器重写路径 - GCP: load balancer rewrite path “错误”类型上不存在属性“代码” - Property 'code' does not exist on type 'Error' “用户”类型不存在 Firebase updateProfile - Firebase updateProfile does not exist on type 'User' 节点负载均衡器和基于路径的路由 - node load balancer and path based routing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM