从 pydoop 访问 hdfs 集群

Question

I have hdfs cluster and python on the same google cloud platform.我在同一个谷歌云平台上有 hdfs 集群和 python。 I want to access the files present in the hdfs cluster from python.我想从 python 访问 hdfs 集群中存在的文件。 I found that using pydoop one can do that but I am struggling with giving it right parameters maybe.我发现使用 pydoop 可以做到这一点，但我可能正在努力为其提供正确的参数。 Below is the code that I have tried so far:-以下是我到目前为止尝试过的代码：-

import pydoop.hdfs as hdfs
import pydoop

pydoop.hdfs.hdfs(host='url of the file system goes here',
                 port=9864, user=None, groups=None)

"""
 class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)

    A handle to an HDFS instance.

    Parameters

            host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.

            port (int) – the port on which the NameNode is listening

            user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.

            groups (list) – ignored. Included for backwards compatibility.


"""

#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))

It gives this error:-它给出了这个错误： -

RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR

And if I execute this line of code:-如果我执行这行代码：-

print (hdfs.ls("/vs_co2_all_2019_v1.csv"))

nothing happens.什么都没发生。 But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.但是这个“vs_co2_all_2019_v1.csv”文件确实存在于集群中，但在我截屏时目前不可用。

My hdfs screenshot is shown below:我的 hdfs 截图如下所示：

and the credentials that I have are shown below:我拥有的凭据如下所示：

Can anybody tell me that what am I doing wrong?谁能告诉我我做错了什么？ Which credentials do I need to put where in the pydoop api?我需要将哪些凭据放在 pydoop api 中的哪个位置？ Or maybe there is another simpler way around this problem, any help will be much appreciated!!或者也许有另一种更简单的方法来解决这个问题，任何帮助将不胜感激！

Answer 1

Have you tried the following?您是否尝试过以下操作？

import pydoop.hdfs as hdfs
import pydoop

hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
                               port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")

or simply:或者简单地说：

hdfs_object.list_directory("/")

Keep in mind that pydoop.hdfs module is not directly related with the hdfs class ( hdfs_object ).请记住， pydoop.hdfs模块与hdfs class ( hdfs_object ) 没有直接关系。 Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")因此，您在第一个命令中建立的连接不会在hdfs.ls("/vs_co2_all_2019_v1.csv")中使用

从 pydoop 访问 hdfs 集群

问题描述

1 个解决方案

解决方案1
0 2022-08-10 08:42:06

从 pydoop 访问 hdfs 集群

问题描述

1 个解决方案

解决方案1 0 2022-08-10 08:42:06

解决方案1
0 2022-08-10 08:42:06