简体   繁体   English

从Pyspark访问HDFS失败

[英]Accessing HDFS from Pyspark fails

I have installed Hadoop 2.7.3 and pyspark 2.2.0 on Ubuntu 17.04. 我已经在Ubuntu 17.04上安装了Hadoop 2.7.3和pyspark 2.2.0。

Both Hadoop and Pyspark seem to work properly on their own. Hadoop和Pyspark似乎都可以正常工作。 However, I did not manage to get files from HDFS in Pyspark. 但是,我没有设法从Pyspark的HDFS中获取文件。 When I try to get a file from HDFS I get the following error: 当我尝试从HDFS获取文件时,出现以下错误:

https://imgur.com/j6Dy2u7 https://imgur.com/j6Dy2u7

I read in another post that the environmental variable HADOOP_CONF_DIR needs to be set to access the HDFS. 我在另一篇文章中读到,需要将环境变量HADOOP_CONF_DIR设置为访问HDFS。 I also did that (see next screenshot), but then I get another error and Pyspark doesn't work any more. 我也这样做了(请参阅下一个屏幕截图),但是随后又出现另一个错误,Pyspark不再起作用。

https://imgur.com/AMpJ6TB https://imgur.com/AMpJ6TB

If I delete the Environmental Variable, everything works as before. 如果我删除了环境变量,那么一切都会像以前一样工作。

How can I fix the issue to open files from HDFS in Pyspark? 如何解决从Pyspark中的HDFS打开文件的问题? I have spent a long time on that and would highly appreciate any help! 我已经花了很长时间了,非常感谢您的帮助!

尽管这个答案有点晚,但是您应该使用hdfs:///test/PySpark.txt (注意3 / s)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM