Spark/Hadoop 在 AWS EMR 上找不到文件

Question

I'm trying to read in a text file on Amazon EMR using the python spark libraries.我正在尝试使用 python spark 库读取 Amazon EMR 上的文本文件。 The file is in the home directory (/home/hadoop/wet0), but spark can't seem to find it.该文件位于主目录 (/home/hadoop/wet0) 中，但 spark 似乎找不到它。

Line in question:有问题的行：

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

Error:错误：

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-19-121.us-west-2.compute.internal:8020/user/hadoop/wet0;'

Does the file have to be in a specific directory?该文件是否必须在特定目录中？ I can't find information about this anywhere on the AWS website.我在 AWS 网站上的任何地方都找不到关于此的信息。

Answer 1

If its in the local filesystem, the URL should be file://user/hadoop/wet0 If its in HDFS, that should be a valid path.如果它在本地文件系统中，URL 应该是 file://user/hadoop/wet0 如果它在 HDFS 中，那应该是一个有效的路径。 Use the hadoop fs command to take a look使用hadoop fs命令查看一下

eg: hadoop fs -ls /home/hadoop例如：hadoop fs -ls /home/hadoop

one think to look at, you say it's in "/home/hadoop", but the path in the error is "/user/hadoop".一想看看，你说它在“/home/hadoop”中，但错误中的路径是“/user/hadoop”。 Make sure you aren't using ~ in the command line, as bash will do the expansion before spark sees it.确保您没有在命令行中使用 ~ ，因为 bash 会在 spark 看到它之前进行扩展。 Best to use the full path /home/hadoop最好使用完整路径/home/hadoop

Answer 2

I don't know if it's just me, but when I tried to solve the problem with the suggestion above, I got an error "path does not exist" in my EMR.我不知道是不是只有我一个人，但是当我尝试使用上述建议解决问题时，我的 EMR 中出现错误“路径不存在”。 I just added one more "/" before user and it worked.我只是在用户之前添加了一个“/”并且它起作用了。

file:///user/hadoop/wet0 file:///user/hadoop/wet0

Thanks for the help!感谢您的帮助！

Spark/Hadoop 在 AWS EMR 上找不到文件

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-11-11 11:50:36

解决方案2
1 2020-06-02 03:34:05

Spark/Hadoop 在 AWS EMR 上找不到文件

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-11-11 11:50:36

解决方案2 1 2020-06-02 03:34:05

解决方案1
3 已采纳 2016-11-11 11:50:36

解决方案2
1 2020-06-02 03:34:05